About

Goto LAST

“Copy/Paste is the mother of learning.”
“Repetition! Repetition is the mother of learning.”

Sources: GitHub | Google Drive | OneDrive

Environment

Assumption: Working directory has sub-folders named "data", "images", "code", "docs".

R Version

# #R Version
R.version.string
## [1] "R version 4.1.2 (2021-11-01)"

Working Directory

# #Working Directory
getwd()
## [1] "D:/Analytics/xADSM"

Session Info

# #Version information about R, the OS and attached or loaded packages
sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19042)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_India.1252  LC_CTYPE=English_India.1252    LC_MONETARY=English_India.1252
## [4] LC_NUMERIC=C                   LC_TIME=English_India.1252    
## 
## attached base packages:
## [1] compiler  grid      stats     graphics  grDevices datasets  utils     methods   base     
## 
## other attached packages:
##  [1] readxl_1.3.1         psych_2.1.9          e1071_1.7-9          scales_1.1.1        
##  [5] viridisLite_0.4.0    latex2exp_0.5.0      microbenchmark_1.4.9 ggpmisc_0.4.4       
##  [9] ggpp_0.4.2           qcc_2.7              VIM_6.1.1            colorspace_2.0-2    
## [13] mice_3.13.0          kableExtra_1.3.4     lubridate_1.8.0      Lahman_9.0-0        
## [17] gapminder_0.3.0      nycflights13_1.0.2   gifski_1.4.3-1       data.table_1.14.2   
## [21] forcats_0.5.1        stringr_1.4.0        dplyr_1.0.7          purrr_0.3.4         
## [25] readr_2.1.0          tidyr_1.1.4          tibble_3.1.6         ggplot2_3.3.5       
## [29] conflicted_1.0.4    
## 
## loaded via a namespace (and not attached):
##  [1] ellipsis_0.3.2       class_7.3-19         rstudioapi_0.13      proxy_0.4-26        
##  [5] farver_2.1.0         listenv_0.8.0        MatrixModels_0.5-0   prodlim_2019.11.13  
##  [9] fansi_0.5.0          ranger_0.13.1        xml2_1.3.2           codetools_0.2-18    
## [13] splines_4.1.2        mnormt_2.0.2         cachem_1.0.6         robustbase_0.93-9   
## [17] knitr_1.36           jsonlite_1.7.2       pROC_1.18.0          caret_6.0-90        
## [21] broom_0.7.10         httr_1.4.2           backports_1.3.0      assertthat_0.2.1    
## [25] Matrix_1.3-4         fastmap_1.1.0        htmltools_0.5.2      quantreg_5.86       
## [29] tools_4.1.2          gtable_0.3.0         glue_1.5.0           reshape2_1.4.4      
## [33] Rcpp_1.0.7           carData_3.0-4        cellranger_1.1.0     jquerylib_0.1.4     
## [37] vctrs_0.3.8          svglite_2.0.0        nlme_3.1-153         conquer_1.2.1       
## [41] iterators_1.0.13     lmtest_0.9-39        timeDate_3043.102    xfun_0.28           
## [45] gower_0.2.2          laeken_0.5.2         globals_0.14.0       rvest_1.0.2         
## [49] lifecycle_1.0.1      future_1.23.0        DEoptimR_1.0-9       MASS_7.3-54         
## [53] zoo_1.8-9            ipred_0.9-12         hms_1.1.1            parallel_4.1.2      
## [57] SparseM_1.81         yaml_2.2.1           sass_0.4.0           rpart_4.1-15        
## [61] stringi_1.7.5        foreach_1.5.1        boot_1.3-28          lava_1.6.10         
## [65] matrixStats_0.61.0   rlang_0.4.12         pkgconfig_2.0.3      systemfonts_1.0.3   
## [69] evaluate_0.14        lattice_0.20-45      recipes_0.1.17       tidyselect_1.1.1    
## [73] parallelly_1.29.0    plyr_1.8.6           magrittr_2.0.1       bookdown_0.24       
## [77] R6_2.5.1             generics_0.1.1       DBI_1.1.1            pillar_1.6.4        
## [81] withr_2.4.2          survival_3.2-13      abind_1.4-5          sp_1.4-6            
## [85] nnet_7.3-16          future.apply_1.8.1   crayon_1.4.2         car_3.0-12          
## [89] utf8_1.2.2           tmvnsim_1.0-2        tzdb_0.2.0           rmarkdown_2.11      
## [93] ModelMetrics_1.2.2.2 vcd_1.4-9            digest_0.6.28        webshot_0.5.2       
## [97] stats4_4.1.2         munsell_0.5.0        bslib_0.3.1

Pandoc

# #Pandoc Version being used by RStudio
rmarkdown::pandoc_version()
## [1] '2.14.0.3'

Aside

I wanted to have a single document containing Notes, Codes, & Output for a quick reference for the lectures. Combination of multiple file formats (docx, csv, xlsx, R, png etc.) was not working out for me. So, I found the Bookdown package to generate this HTML file.

All of us had to stumble through some of the most common problems individually and as we are approaching deeper topics, a more collaborative approach might be more beneficial.

Further, the lectures are highly focused and thus I had to explore some sidetopics in more details to get the most benefit from them. I have included those topics and I am also interested in knowing about your experiences too.

Towards that goal, I am sharing these notes and hoping that you would run the code in your own environment and would raise any queries, problems, or difference in outcomes. Any suggestion or criticism is welcome. I have tried to not produce any significant changes in your working environment. Please let me know if you observe otherwise.

Currently, my priority is to get in sync with the ongoing lectures. The time constraint has led to issues given below. These will be corrected as and when possible.

  • Tone of the document may be a little abrupt, please overlook that
  • Source references are not added as much as I wanted to (from where I copy/pasted learned) and there is no easy solution for this.
  • I have NOT explained some of the functions before their usage (lapply(), identical() etc.) Hyperlinks for that will be added as and when those topics are covered
  • Code has been checked only on Windows 10. For Mac or Linux, if when you find something that has different output or behaviour, please let me know
  • Although these notes are generated using R Markdown and Bookdown, I have not yet covered these. If you need any help in creating your own notes, please let me know. If I have the solution for your problem, I will share.

Last, but not the least, I am also learning while creating this, so if you think I am wrong somewhere, please point it out. I am always open for suggestions.

Thank You all for the encouragement.

Shivam


(B01)


(B02)


(B03)


(B04)


(B05)


(B06)


(B07)


(B08)


1 R Introduction (B09, Aug-31)

1.1 R Basics

R is Case-sensitive i.e. c() not C() and View() not view()

Hash Sign “#” comments out anything after it, till the newline. There are no multiline comments.

Backslash “\” is reserved to escape the character that follows it.

Escape key stops the parser i.e. “+” sign where R is waiting for more input before evaluation.

Overview

1.1.1 R Studio

  • There are 4 Panes -
    1. Top Left - R Editor, Source
    2. Bottom Left - Console, Terminal, …
    3. Top Right - Environment, History, …
    4. Bottom Right - Plots, …
  • Sometimes there are only 3 panes i.e. Editor Pane is missing
    • To Open Editor Pane - Create a New R Script by “File | New File | R Script” or "Ctrl+ Shift+ N"
  • To Modify Pane Settings - Tools | Global Options | Pane Layout

1.1.2 Shortcuts

  • Execute the current expression in Source Pane (Top): “Ctrl+ Enter”
  • Execute the current expression in Console Pane (Bottom): “Enter”
  • Clear the Console Pane (Bottom): “Ctrl+ L”
  • Restart the Current R Session: “Ctrl+ Shift+ F10”
  • Create a New R Script: “Ctrl+ Shift+ N”
  • Insert ” <- ” i.e. Assignment Operator with Space: “Alt+ -”
  • Insert ” %>% ” i.e. Pipe Operator with Space: “Ctrl+ Shift+ M”
  • Comment or Uncomment Lines: “Ctrl+ Shift+ C”
  • Set Working Directory: “Ctrl+ Shift+ H”
  • Search Command History: “Ctrl+ Up Arrow”
  • Search Files: “Ctrl+ .”

1.1.3 Executing an Expression

Execute the current expression in Source Pane (Top) by ‘Run’ Button or "Ctrl+ Enter"

Execute the current expression in Console Pane (Bottom) by “Enter”

1.1.4 PATH and Working Directory

Windows 10 uses backslash “\” for PATH. R, however, uses slash “/.” Backslash “\” is escape character in R.

  • So, To provide “C:\Users\userName\Documents” as PATH
    • Use: “C:\\Users\\userName\\Documents”
    • OR: “C:/Users/userName/Documents”
    • OR: “~” Tilde acts as a Reference to Home Directory

In R Studio, Set Working Directory by:

  • Session | Set Working Directory | Choose Directory or "Ctrl+ Shift+ H"
# #Current Working Directory
getwd()
## [1] "D:/Analytics/xADSM"
#
# #R Installation Directory (Old DOS Convention i.e. ~1 after 6 letters)
R.home()
## [1] "C:/PROGRA~1/R/R-41~1.2"
Sys.getenv("R_HOME") 
## [1] "C:/PROGRA~1/R/R-41~1.2"
#
# #This is Wrapped in IF Block to prevent accidental execution
if(FALSE){
# #WARNING: This will change your Working Directory
  setwd("~")
}

1.1.5 Printing

If the R program is written over the console, line by line, then the output is printed automatically i.e. no function needed for printing. This is called implicit printing.

Inside an R Script File, implicit printing does not work and the expression needs to be printed explicitly.

In R, the most common method to print the output ‘explicitly’ is by the function print().

# #Implicit Printing: This will NOT be printed to Console, if it is inside an R Script.
"Hello World!"
#
# #Implicit Printing using '()': Same as above
("Hello World!")
#
# #Explicit Printing using print() : To print Objects to Console, even inside an R Script.
print("Hello World!")
## [1] "Hello World!"

1.2 Objects

1.2.1 List ALL Objects

Everything that exists in R is an object in the sense that it is a kind of data structure that can be manipulated. Expressions for evaluation are themselves objects; Evaluation consists of taking the object representing an expression and returning the object that is the value of that expression.

# #ls(): List ALL Objects in the Current NameSpace (Environment)
ls()
## character(0)

1.2.2 Assign a Value to an Object

Caution: Always use “<-” for the assignment, NOT the “=”

While the “=” can be used for assignment, its usage for assignment is highly discouraged because it may behave differently under certain subtle conditions which are difficult to debug. Convention is to use “=” only during function calls for arguments association (syntactic token).

There are 5 assignment operators (<-, =, <<-, ->, ->>), others are not going to be discussed for now.

All the created objects are listed in the Environment Tab of the Top Right Pane.

# #Assignment Operator "<-" is used to assign any value (ex: 10) to any object (ex: 'bb')
bb <- 10
#
# #Print Object
print(bb)
## [1] 10

1.2.3 Remove an Object

In the Environment Tab, any object can be selected and deleted using Brush.

# #Trying to Print an Object 'bb' and Handling the Error, if thrown
tryCatch(print(bb), error = function(e) print(paste0(e)))
## [1] 10
#
# #Remove an Object
rm(bb)
#
# #Equivalent
if(FALSE) {rm("bb")} #Same
if(FALSE) {rm(list = "bb")} #Faster, verbose, and would not work without quotes
#
# #Trying to Print an Object 'bb' and Handling the Error, if thrown
tryCatch(print(bb), error = function(e) print(paste0(e)))
## [1] "Error in print(bb): object 'bb' not found\n"

1.3 Data

6.1 Data are the facts and figures collected, analysed, and summarised for presentation and interpretation.

6.2 Elements are the entities on which data are collected. (Generally ROWS)

6.3 A variable is a characteristic of interest for the elements. (Generally COLUMNS)

6.4 The set of measurements obtained for a particular element is called an observation.

6.5 Statistics is the art and science of collecting, analysing, presenting, and interpreting data.

1.4 Vectors

R has 6 basic data types (logical, integer, double, character, complex, and raw). These data types can be combined to form Data Structures (vector, list, matrix, dataframe, factor etc.). Refer What is a Vector!

Definition 1.1 Vectors are the simplest type of data structure in R. A vector is a sequence of data elements of the same basic type.
Definition 1.2 Members of a vector are called components.

Atomic vectors are homogeneous i.e. each component has the same datatype. A vector type can be checked with the typeof() or class() function. Its length, i.e. the number of elements in the vector, can be checked with the function length().

If the output of an expression does not show numbers in brackets like ‘[1]’ then it is a ’NULL’ type return. [Numbers] show that it is a Vector. Ex: str() and cat() outputs are of NULL Type.

Use function c() to create a vector (or a list) -

  • In R, a literal character or number is just a vector of length 1. So, c() ‘combines’ them together in a series of 1-length vectors.
  • c() neither creates nor concatenates the vectors, it combines them. Thus, it combines list into a list and vectors into a vector.
  • In R, list is a ‘Vector’ but not an ‘Atomic Vector.’
  • All arguments are coerced to a common type which is the type of the returned value.
  • All attributes (e.g. dim) except ‘names’ are removed.
  • The output type is determined from the highest type of the components in the hierarchy NULL < raw < logical < integer < double < complex < character < list < expression.
  • To “index a vector” means, to address specific elements by using square brackets, i.e. x[10] means the \({10^{th}}\) element of vector ‘x.’

Caution: Colon “:” might produce unexpected length of vectors (in case of 0-length vectors). Suggestion: Use colon only with hardcoded numbers i.e. “1:10” is ok, “1:n” is dangerous and should be avoided.

Caution: seq() function might produce unexpected type of vectors (in case of 1-length vectors). Suggestion: Use seq_along(), seq_len().

Atomic Vectors

# #To know about an Object: str(), class(), length(), dim(), typeof(), is(), attributes(), names()
# #Integer: To declare as integer "L" (NOT "l") is needed
ii_int <- c(1L, 2L, 3L, 4L, 5L)
str(ii_int)
##  int [1:5] 1 2 3 4 5
#
# #Double (& Default)
dd_dbl <- c(1, 2, 3, 4, 5)
str(dd_dbl)
##  num [1:5] 1 2 3 4 5
#
# #Character
cc_chr <- c('a', 'b', 'c', 'd', 'e')
str(cc_chr)
##  chr [1:5] "a" "b" "c" "d" "e"
#
# #Logical
ll_lgl <- c(TRUE, FALSE, FALSE, TRUE, TRUE)
str(ll_lgl)
##  logi [1:5] TRUE FALSE FALSE TRUE TRUE

Integer

# #Integer Vector of Length 1
nn <- 5L
#
# #Colon ":" Operator - Avoid its usage
str(c(1:nn))
##  int [1:5] 1 2 3 4 5
c(typeof(pi:6), typeof(6:pi))
## [1] "double"  "integer"
#
# #seq() - Avoid its usage
str(seq(1, nn))
##  int [1:5] 1 2 3 4 5
str(seq(1, nn, 1))
##  num [1:5] 1 2 3 4 5
str(seq(1, nn, 1L))
##  num [1:5] 1 2 3 4 5
str(seq(1L, nn, 1L))
##  int [1:5] 1 2 3 4 5
#
# #seq_len()
str(seq_len(nn))
##  int [1:5] 1 2 3 4 5

Double

str(seq(1, 5, 1))
##  num [1:5] 1 2 3 4 5

Character

str(letters[1:5])
##  chr [1:5] "a" "b" "c" "d" "e"

Logical

str(1:5 %% 2 == 0)
##  logi [1:5] FALSE TRUE FALSE TRUE FALSE

1.5 DataFrame

# #Create Two Vectors
income <- c(100, 200, 300, 400, 500)
gender <- c("male", "female", "female", "female", "male")
#
# #Create a DataFrame
bb <- data.frame(income, gender)
#
# #Print or View DataFrame
#View(bb)
print(bb)
##   income gender
## 1    100   male
## 2    200 female
## 3    300 female
## 4    400 female
## 5    500   male
#
# #Struture
str(bb)
## 'data.frame':    5 obs. of  2 variables:
##  $ income: num  100 200 300 400 500
##  $ gender: chr  "male" "female" "female" "female" ...
#
# #Names
names(bb)
## [1] "income" "gender"

1.6 Save and Load an R Script

R Script file extension is “.R”

"Ctrl+ S" will Open Save Window at Working Directory.

"Ctrl+ O" will Open the Browse Window at Working Directory.

Check File Exist

# #Subdirectory "data" has data files like .csv .rds .txt .xlsx
# #Subdirectory "code" has scripts files like .R 
# #Subdirectory "images" has images like .png
#
# #Check if a File exists 
path_relative <- "data/aa.xlsx" #Relative Path
#
if(file.exists(path_relative)) {
    cat("File Exists\n") 
  } else {
    cat(paste0("File does not exist at ", getwd(), "/", path_relative, "\n"))
  }
## File Exists
#
if(exists("XL", envir = .z)) {
  cat(paste0("Absolute Path exists as: ", .z$XL, "\n"))
  path_absolute <- paste0(.z$XL, "aa", ".xlsx") #Absolute Path
  #
  if(file.exists(path_absolute)) {
    cat("File Exists\n") 
  } else {
    cat(paste0("File does not exist at ", path_absolute, "\n"))
  }
} else {
  cat(paste0("Object 'XL' inside Hidden Environment '.z' does not exist. \n", 
             "It is probably File Path of the Author, Replace the File Path from Your own Directory\n"))
}
## Absolute Path exists as: D:/Analytics/xADSM/data/
## File Exists

Aside

  • This section is NOT useful for general reader and can be safely ignored. It contains my notes related to building this book. These are useful only for someone who is building his own book. (Shivam)
  • “Absolute Path” is NOT a problem in Building a Book, Knitting a Chapter, or on Direct Console.
  • “Absolute Path” has a problem only when Running code chunk directly from the Rmd document and when the Rmd document is inside a sub-directory (like in this book), then only the Working Directory differs.

1.7 CSV Import /Export

write.csv() and read.csv() combination can be used to export data and import it back into R. But, it has some limitations -

  • Re-imported object “yy_data” will NOT match with the original object “xx_data” under default conditions
    1. write.csv(), by default, write row.names (or row numbers) in the first column.
      • So, either use row.names = FALSE while writing
      • OR use row.names = 1 while reading
    2. row.names attribute is always read as ‘character’ even though originally it might be ‘integer.’
      • So, that attribute needs to be coerced
    3. colClasses() needs to be defined to match with the original dataframe, otherwise ‘income’ is read as ‘integer,’ even though originally it was ‘numeric.’
    4. Conclusion: Avoid, if possible.
  • Alternative: saveRDS() and readRDS()
    • Functions to write a single R object to a file, and to restore it.
    • Imported /Exported objects are always identical
ERROR 1.1 Error in file(file, ifelse(append, "a", "w")) : cannot open the connection
  • Check the path, file name, & file extension for typing mistakes
  • Execute getwd(), just before the command, to confirm that the working directory is as expected

write.csv()

xx_data <- bb
#
# #Write a dataframe to a CSV File
write.csv(xx_data, "data/B09_xx_data.csv")
#
# #Read from the CSV into a dataframe
yy_data <- read.csv("data/B09_xx_data.csv")
#
# #Check if the object being read is same as the obejct that was written 
identical(xx_data, yy_data)
## [1] FALSE

Match Objects

# #Exercise to show how to match the objects being imported /exported from CSV
str(bb)
## 'data.frame':    5 obs. of  2 variables:
##  $ income: num  100 200 300 400 500
##  $ gender: chr  "male" "female" "female" "female" ...
xx_data <- bb
# #Write to CSV
write.csv(xx_data, "data/B09_xx_data.csv")
#
# #Read from CSV by providing row.names Column and colClasses()
yy_data <- read.csv("data/B09_xx_data.csv", row.names = 1,
                    colClasses = c('character', 'numeric', 'character'))
#
# #Coerce row.names attribute to integer
attr(yy_data, "row.names") <- as.integer(attr(yy_data, "row.names"))
#
# #Check if the objects are identical
identical(xx_data, yy_data)
## [1] TRUE
stopifnot(identical(xx_data, yy_data))

RDS

str(bb)
## 'data.frame':    5 obs. of  2 variables:
##  $ income: num  100 200 300 400 500
##  $ gender: chr  "male" "female" "female" "female" ...
xx_data <- bb
#
# #Save the Object as RDS File
saveRDS(xx_data, "data/B09_xx_data.rds")
#
# #Read from the RDS File
yy_data <- readRDS("data/B09_xx_data.rds")
#
# #Objects are identical (No additional transformations are needed)
identical(xx_data, yy_data)
## [1] TRUE

1.8 Modify Dataframe

str(xx_data)
## 'data.frame':    5 obs. of  2 variables:
##  $ income: num  100 200 300 400 500
##  $ gender: chr  "male" "female" "female" "female" ...
# #Adding a Column to a dataframe
xx_data <- data.frame(xx_data, age = 22:26)
#
# #Adding a Column to a dataframe by adding a Vector
x_age <- 22:26
xx_data <- data.frame(xx_data, x_age)
str(xx_data)
## 'data.frame':    5 obs. of  4 variables:
##  $ income: num  100 200 300 400 500
##  $ gender: chr  "male" "female" "female" "female" ...
##  $ age   : int  22 23 24 25 26
##  $ x_age : int  22 23 24 25 26
#
# #Adding a Column to a dataframe by using dollar "$"
xx_data$age1 <- x_age
#
# #Adding a Blank Column using NA
xx_data$blank <- NA
#
# #Editing of a dataframe can also be done
# edit(xx_data)
str(xx_data)
## 'data.frame':    5 obs. of  6 variables:
##  $ income: num  100 200 300 400 500
##  $ gender: chr  "male" "female" "female" "female" ...
##  $ age   : int  22 23 24 25 26
##  $ x_age : int  22 23 24 25 26
##  $ age1  : int  22 23 24 25 26
##  $ blank : logi  NA NA NA NA NA
#
# #Removing a Column by subsetting
xx_data <- xx_data[ , -c(3)]
#
# #Removing a Column using NULL
xx_data$age1 <- NULL
str(xx_data)
## 'data.frame':    5 obs. of  4 variables:
##  $ income: num  100 200 300 400 500
##  $ gender: chr  "male" "female" "female" "female" ...
##  $ x_age : int  22 23 24 25 26
##  $ blank : logi  NA NA NA NA NA

1.9 Packages

Definition 1.3 Packages are the fundamental units of reproducible R code.

Packages include reusable functions, the documentation that describes how to use them, and sample data.

In R Studio: Packages Tab | Install | Package Name = “psych” | Install

  • Packages are installed from CRAN Servers
    • To Change Server: Tools | Global Options | Packages | Primary CRAN Repository | Change | CRAN Mirrors (Select Your Preference) | OK
    • All Installed Packages are listed under Packages Tab
    • All Loaded Packages are listed under Packages Tab with a Tick Mark
    • Some packages are dependent on other packages and those are also installed when ‘dependencies = TRUE’
    • If a package is NOT installed properly, it will show error when loaded by library()

Install Packages

if(FALSE){
  # #WARNING: This will install packages and R Studio will NOT work for that duration
  # #Install Packages and their dependencies
  install.packages("psych", dependencies = TRUE)
}

Load Packages

# #Load a Package with or without Quotes
library(readxl)
library("readr")

Load Multiple Packages

# #Load Multiple Packages
pkg_chr <- c("ggplot2", "tibble", "tidyr", "readr", "dplyr")
#lapply(pkg_chr, FUN = function(x) {library(x, character.only = TRUE)})
#
# #Load Multiple Packages, Suppress Startup Messages, and No console output
invisible(lapply(pkg_chr, FUN = function(x) {
  suppressMessages(library(x, character.only = TRUE))}))

Detach Package

# #Detach a package
#detach("package:psych", unload = TRUE)
#
# #Search Package in the already loaded packages
pkg_chr <- "psych"
if (pkg_chr %in% .packages()) {
# #Detach a package that has been loaded previously
  detach(paste0("package:", pkg_chr), character.only = TRUE, unload = TRUE)
}

1.10 Import Flights Data

To Import Excel in R Studio : Environment | Dropdown | From Excel | Browse

Object imported by read.csv() i.e. ‘mydata’ is NOT same as the one imported by read_excel() i.e. ‘mydata_xl’

  • read_excel() imports as a Tibble which is a modern view of dataframe. It is more restrictive so that output would be more predictable.
  • read.csv(), if possible, imports as integer (ex: ‘year’ column). But, read_excel() imports, if possible, as a numeric.
  • Further, read_excel() has imported many columns as ‘character’ that should have been ‘numeric’ ex: dep_time
  • NOTE: To complete the set readr::read_csv() is also covered here which reads CSV and generates a Tibble.

All of these objects can be converted into any other form as needed i.e. dataframe to tibble or vice-versa.

Flights

# #Data File Name has been modified to include lecture number "B09"
# #All Data Files are in the sub-directory named 'data'
mydata <- read.csv("data/B09-FLIGHTS.csv")
#
# #To Copy from Clipboard, assuming copied from xlsx i.e. tab separated data
mydata_clip <- read.csv("clipboard", sep = '\t', header = TRUE)

RDS

# #Following Setup allows us to read CSV only once and then create an RDS file
# #Its advantage is in terms of faster loading time and lower memory requirment
xx_csv <- paste0("data/", "B09-FLIGHTS",".csv")
xx_rds <- paste0("data/", "b09_flights", ".rds")
b09_flights <- NULL
if(file.exists(xx_rds)) {
  b09_flights <- readRDS(xx_rds)
} else {
  # #Read CSV
  b09_flights <- read.csv(xx_csv)
  # #Write Object as RDS
  saveRDS(b09_flights, xx_rds)
}
rm(xx_csv, xx_rds)
mydata <- b09_flights

Structure

str(mydata)
## 'data.frame':    336776 obs. of  19 variables:
##  $ year          : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time      : int  517 533 542 544 554 554 555 557 557 558 ...
##  $ sched_dep_time: int  515 529 540 545 600 558 600 600 600 600 ...
##  $ dep_delay     : int  2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
##  $ arr_time      : int  830 850 923 1004 812 740 913 709 838 753 ...
##  $ sched_arr_time: int  819 830 850 1022 837 728 854 723 846 745 ...
##  $ arr_delay     : int  11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ carrier       : chr  "UA" "UA" "AA" "B6" ...
##  $ flight        : int  1545 1714 1141 725 461 1696 507 5708 79 301 ...
##  $ tailnum       : chr  "N14228" "N24211" "N619AA" "N804JB" ...
##  $ origin        : chr  "EWR" "LGA" "JFK" "JFK" ...
##  $ dest          : chr  "IAH" "IAH" "MIA" "BQN" ...
##  $ air_time      : int  227 227 160 183 116 150 158 53 140 138 ...
##  $ distance      : int  1400 1416 1089 1576 762 719 1065 229 944 733 ...
##  $ hour          : int  5 5 5 5 6 5 6 6 6 6 ...
##  $ minute        : int  15 29 40 45 0 58 0 0 0 0 ...
##  $ time_hour     : chr  "2013-01-01 05:00:00" "2013-01-01 05:00:00" "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...

Tail

tail(mydata)
##        year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
## 336771 2013     9  30       NA           1842        NA       NA           2019        NA      EV
## 336772 2013     9  30       NA           1455        NA       NA           1634        NA      9E
## 336773 2013     9  30       NA           2200        NA       NA           2312        NA      9E
## 336774 2013     9  30       NA           1210        NA       NA           1330        NA      MQ
## 336775 2013     9  30       NA           1159        NA       NA           1344        NA      MQ
## 336776 2013     9  30       NA            840        NA       NA           1020        NA      MQ
##        flight tailnum origin dest air_time distance hour minute           time_hour
## 336771   5274  N740EV    LGA  BNA       NA      764   18     42 2013-09-30 18:00:00
## 336772   3393    <NA>    JFK  DCA       NA      213   14     55 2013-09-30 14:00:00
## 336773   3525    <NA>    LGA  SYR       NA      198   22      0 2013-09-30 22:00:00
## 336774   3461  N535MQ    LGA  BNA       NA      764   12     10 2013-09-30 12:00:00
## 336775   3572  N511MQ    LGA  CLE       NA      419   11     59 2013-09-30 11:00:00
## 336776   3531  N839MQ    LGA  RDU       NA      431    8     40 2013-09-30 08:00:00

Excel

# #library(readxl)
mydata_xl <- read_excel("data/B09-FLIGHTS.xlsx", sheet = "FLIGHTS")

Excel RDS

# #library(readxl)
xx_xl <- paste0("data/", "B09-FLIGHTS",".xlsx")
xx_rds_xl <- paste0("data/", "b09_flights_xls", ".rds")
b09_flights_xls <- NULL
if(file.exists(xx_rds_xl)) {
  b09_flights_xls <- readRDS(xx_rds_xl)
} else {
  b09_flights_xls <- read_excel(xx_xl, sheet = "FLIGHTS")
  saveRDS(b09_flights_xls, xx_rds_xl)
}
rm(xx_xl, xx_rds_xl)
mydata_xl <- b09_flights_xls
#

xlsx

str(mydata_xl)
## tibble [336,776 x 19] (S3: tbl_df/tbl/data.frame)
##  $ year          : num [1:336776] 2013 2013 2013 2013 2013 ...
##  $ month         : num [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
##  $ day           : num [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time      : chr [1:336776] "517" "533" "542" "544" ...
##  $ sched_dep_time: num [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
##  $ dep_delay     : chr [1:336776] "2" "4" "2" "-1" ...
##  $ arr_time      : chr [1:336776] "830" "850" "923" "1004" ...
##  $ sched_arr_time: num [1:336776] 819 830 850 1022 837 ...
##  $ arr_delay     : chr [1:336776] "11" "20" "33" "-18" ...
##  $ carrier       : chr [1:336776] "UA" "UA" "AA" "B6" ...
##  $ flight        : num [1:336776] 1545 1714 1141 725 461 ...
##  $ tailnum       : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
##  $ origin        : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
##  $ dest          : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
##  $ air_time      : chr [1:336776] "227" "227" "160" "183" ...
##  $ distance      : num [1:336776] 1400 1416 1089 1576 762 ...
##  $ hour          : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
##  $ minute        : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
##  $ time_hour     : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...

readr

# #Following Setup allows us to read CSV only once and then create an RDS file
# #Its advantage is in terms of faster loading time and lower memory requirment
# #library(readr)
xx_csv <- paste0("data/", "B09-FLIGHTS",".csv")
xx_rds <- paste0("data/", "xxflights", ".rds")
xxflights <- NULL
if(file.exists(xx_rds)) {
  xxflights <- readRDS(xx_rds)
} else {
  xxflights <- read_csv(xx_csv, show_col_types = FALSE)
  attr(xxflights, "spec") <- NULL
  attr(xxflights, "problems") <- NULL
  saveRDS(xxflights, xx_rds)
}
rm(xx_csv, xx_rds)
mydata_rdr <- xxflights

1.11 Subsetting

# #Subset All Rows and last 3 columns
data6 <- mydata[ , c(17:19)]
str(data6)
## 'data.frame':    336776 obs. of  3 variables:
##  $ hour     : int  5 5 5 5 6 5 6 6 6 6 ...
##  $ minute   : int  15 29 40 45 0 58 0 0 0 0 ...
##  $ time_hour: chr  "2013-01-01 05:00:00" "2013-01-01 05:00:00" "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
# #Subset by deleting the 1:16 columns
data7 <- mydata[ , -c(1:16)]
stopifnot(identical(data6, data7))

1.12 Attach a Dataset

Caution: Attaching a Dataset should be avoided to prevent unexpected behaviour due to ‘masking.’ Using full scope resolution i.e. ‘data_frame$column_header’ would result in fewer bugs. However, if a Dataset has been attached, please ensure that it is detached also.

if(FALSE){
  # #WARNING: Attaching a Dataset is discouraged because of 'masking'
  # #'dep_time' is Column Header of a dataframe 'mydata'
  tryCatch(str(dep_time), error = function(e) print(paste0(e)))
## [1] "Error in str(dep_time): object 'dep_time' not found\n"
  # #Attach the Dataset
  attach(mydata)
  # #Now all the column headers are accessible without the $ sign
  str(dep_time)
## int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
  # #But, there are other datasets also, attaching another one results in MESSAGE
  attach(mydata_xl)
## The following objects are masked from mydata:
##
##     air_time, arr_delay, arr_time, carrier, day, dep_delay, dep_time, dest,
##     distance, flight, hour, minute, month, origin, sched_arr_time,
##     sched_dep_time, tailnum, time_hour, year
  str(dep_time)
## chr [1:336776] "517" "533" "542" "544" "554" "554" "555" "557" "557" "558" "558" ...
#
# #'mydata_xl$dep_time' masked the already present 'mydata$dep_time'.
# #Thus now it is showing as 'chr' in place of original 'int'
# #Column Header Names can be highly varied and those will silently mask other variable
# #Hence, attaching a dataset would result in random bugs or unexpected behaviours
#
# #Detach a Dataset
  detach(mydata_xl)
  detach(mydata)
}

1.13 Package “psych”

  • pairs.panels() -
    • It shows a scatter plot of matrices, with bivariate scatter plots below the diagonal, histograms on the diagonal, and the Pearson correlation above the diagonal.
      • Calculation time is highly dependent on dataset size and type
      • See Figure 1.1
      • Conclusion: “air_time and distance are highly correlated”
ERROR 1.2 Error in plot.window(...) : need finite ’xlim’ values
  • In this case, the error will be observed if the output of pairs.panels() is assigned to an object.
  • Direct console output (i.e. no assignment) should not be a problem
ERROR 1.3 Error in par(old.par) : invalid value specified for graphical parameter "pin"
  • The Error is generally observed because the Plot does not have enough space in R Studio (Lower Right Pane). In general, it is NOT a problem. It is an error of map() xlim[1] should be less than xlim[2].
  • Use larger window size or control image size output

Image

Correlation using psych::pairs.panels()

Figure 1.1 Correlation using psych::pairs.panels()

Code

# # Subset 3 Columns and 1,00,000 rows 
x_rows <- 100000L
data_pairs <- mydata[1:x_rows, c(7, 16, 9)]
#
# #Equivalent
data_pairs <- mydata  %>%
  select(air_time, distance, arr_delay) %>%
  slice_head(n = x_rows)
#
if( nrow(data_pairs) * ncol(data_pairs) > 1000000 ) {
  print("Please reduce the number of points to a sane number!")
  ggplot()
} else {
  #B09P01
  pairs.panels(data_pairs)
}

Validation


2 R Introduction (B10, Sep-05)

2.2 Notebooks

These allow you to combine executable code and rich text in a single document, along with images, HTML, LaTeX and more.

Definition 2.1 R Markdown is a file format for making dynamic documents with R.

An R Markdown document is written in markdown (an easy-to-write plain text format) and contains chunks of embedded R code. To know more Go to Rstudio

To know more about Google Colab Go to Google Colab

NOTE: As I am not using Google Colab, the workflow explained between 00:00 to 35:10 is NOT covered here. If someone is using Google Colab, and is willing to share their notes, I would include those.

2.3 Plot

Base R graphs /plots as shown in figure 2.1

ERROR 2.1 Error in plot.window(...) : need finite ’xlim’ values
  • If this error is coming when base R plot() function is called
  • Check if data has NA values or if character data is supplied where numerical is needed
  • Also, do not use assignment to save base R plot and then print
Flights: Arrival Time (Y) vs. Departure (X) Time

Figure 2.1 Flights: Arrival Time (Y) vs. Departure (X) Time

2.4 Dataset

  • Use cbind() or rbind() to merge dataframes

Dimensions

# #Create a Subset of Dataframe of 1000 Rows for quick calculations
bb <- head(mydata, 1000)
#
# #Dimensions: dim() Row x Column; nrow(); ncol()
dim(bb)
## [1] 1000   19
#
stopifnot(identical(nrow(bb), dim(bb)[1]))
stopifnot(identical(ncol(bb), dim(bb)[2]))

Split

# #Split a Dataframe by subsetting
data_1 <- bb[ ,1:8]
data_2 <- bb[ ,9:19]
# str(data_1)

Merge

# #Merge a Dataframe by cbind()
data_3 <- cbind(data_1, data_2)
# #Equivalent
data_4 <- data.frame(data_1, data_2)
# str(bb_3)
stopifnot(identical(data_3, data_4))

RowSplit

# #Row Split
data_5 <- bb[1:300, ]
data_6 <- bb[301:1000, ]
#
# #Equivalent
n_rows <- 300L
data_5 <- bb[1:n_rows, ]
data_6 <- bb[(n_rows+1L):nrow(bb), ]
#
stopifnot(identical(data_5, head(bb, n_rows)))
stopifnot(identical(data_6, tail(bb, (nrow(bb)-n_rows))))

RowMerge

# #Merge a Dataframe by rbind()
data_7 <- rbind(data_5, data_6)
stopifnot(identical(bb, data_7))

2.5 Change Column Headers

# #Change A Specific Name based on Index Ex: First Header "year" -> "YEAR"
# #NOTE: Output of 'names(bb)' is a character vector, not a dataframe
# #So, [1] is being used to subset for 1st element and NOT the [ , 1] (as done for dataframe)
(names(bb)[1] <- "YEAR")
## [1] "YEAR"
#
# #Change all Column Headers to Uppercase by toupper() or Lowercase by tolower()
names(bb) <- toupper(names(bb))

2.6 NA

Definition 2.2 NA is a logical constant of length 1 which contains a missing value indicator.

NA can be coerced to any other vector type except raw. There are also constants like NA_integer_, NA_real_ etc. For checking only the presence of NA, anyNA() is faster than is.na()

Overview of ‘Not Available’

  • If the imported data has blank cell, it would be imported as NA

To remove all NA

  • na.omit()
    • Output is a dataframe
    • It is slower but adds the omitted row numbers as an attribute i.e. na.action
  • complete.cases()
    • Output is a logical vector, thus it needs subsetting to get the dataframe
    • Faster and also allows partial selection of columns i.e. ignore NA in other columns
    • Caution: It may throw Error if ‘POSIXlt’ Columns are present
  • tidyr::drop_na()
  • rowSums(is.na())
    • It can also be used for excluding rows with more than allowed numbers of NA. However, in general, this is not recommended because random columns retain NA. These may break the code later or change the number of observations. It is useful when all columns are similar in nature e.g. if each column represent response to a survey question.

NA

bb <- xxflights
# #anyNA() is faster than is.na()
if(anyNA(bb)) print("NA are Present!") else print("NA not found")
## [1] "NA are Present!"
#
# #Columnwise NA Count
bb_na_col <- colSums(is.na(bb))
#
# #Vector of Columns having NA
which(bb_na_col != 0)
##  dep_time dep_delay  arr_time arr_delay   tailnum  air_time 
##         4         6         7         9        12        15
stopifnot(identical(which(bb_na_col != 0), which(vapply(bb, anyNA, logical(1)))))
#
# #Indices of Rows with NA
head(which(!complete.cases(bb)))
## [1] 472 478 616 644 726 734
#
# #How many rows contain NA
sum(!complete.cases(bb))
## [1] 9430
#
# #How many rows have NA in specific Columns
sum(!complete.cases(bb[, c(6, 9, 4)]))
## [1] 9430

RemoveNA

# #Remove all rows which have any NA 
# #na.omit(), complete.cases(), tidyr::drop_na(), rowSums(is.na())
bb_1 <- na.omit(bb)
# #Print the Count of removed rows containg NA
print(paste0("Note: ", length(attributes(bb_1)$na.action), " rows removed."))
## [1] "Note: 9430 rows removed."
#
# #Remove additional Attribute added by na.omit()
attr(bb_1, "na.action") <- NULL
#
# #Equivalent 
bb_2 <- bb[complete.cases(bb), ]
bb_3 <- bb %>% drop_na()
bb_4 <- bb[rowSums(is.na(bb)) == 0, ]
#Validation
stopifnot(all(identical(bb_1, bb_2), identical(bb_1, bb_3), identical(bb_1, bb_4)))
#
# #complete.cases also allow partial selection of specific columns
# #Remove rows which have NA in some columns i.e. ignore NA in other columns
dim(bb[complete.cases(bb[ , c(6, 9, 4)]), ])
## [1] 327346     19
# #Equivalent 
dim(bb %>% drop_na(dep_delay, arr_delay, dep_time))
## [1] 327346     19
#
# #Remove rows which have more than allowed number of NA (ex:4) in any column
# #Caution: In general, this is not recommended because random columns retain NA
dim(bb[rowSums(is.na(bb)) <= 4L, ])
## [1] 328521     19

2.7 Apply

Sources: Grouping Functions and the Apply Family, Why is vapply safer than sapply, Hadley - Advanced R - Functionals, This, This, & This

Apply Function in R are designed to avoid explicit use of loop constructs.

  • To manipulate slices of data in a repetitive way.
  • They act on an input list, matrix or array, and apply a named function with one or several optional arguments.
  1. apply(X, MARGIN, FUN, ..., simplify = TRUE)
    • Refer R Manual p72 - “Apply Functions Over Array Margins”
    • Returns a vector or array or list of values obtained by applying a function to margins of an array or matrix.
    • When you want to apply a function to the rows or columns of a matrix (and higher-dimensional analogues); not generally advisable for data frames as it will coerce to a matrix first
    • MARGIN = 1 indicates application over ROWS, 2 indicates COLUMNS
    • Examples & Details: “ForLater”
  2. lapply(X, FUN, ...)
    • Refer R Manual p342 - “Apply a Function over a List or Vector”
    • ‘list’ apply i.e. lapply returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.
    • Examples & Details: “ForLater”
    • When you want to apply a function to each element of a list in turn and get a list back.
    • lapply(x, mean)
    • lapply(x, function(x) c(mean(x), sd(x)))
  3. sapply(X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE)
    • ‘simplified’ wrapper of lapply
    • When you want to apply a function to each element of a list in turn, but you want a vector back, rather than a list.
    • Caution: It sometimes fails silently or unexpectedly changes output type
  4. vapply(X, FUN, FUN.VALUE, ..., USE.NAMES = TRUE)
    • ‘verified’ apply i.e. vapply is similar to sapply, but has a pre-specified type of return value, so it can be safer (and sometimes faster) to use.
    • vapply returns a vector or array of type matching the FUN.VALUE.
    • With FUN.VALUE you can specify the type and length of the output that should be returned each time your applied function is called.
    • It improves consistency by providing limited return type checks.
    • Further, if the input length is zero, sapply will always return an empty list, regardless of the input type (Thus behaving differently from non-zero length input). Whereas, with vapply, you are guaranteed to have a particular type of output, so you do not need to write extra checks for zero length inputs.
  5. Others - “ForLater”
    • tapply is a tagged apply where the tags identify the subsets
    • mapply for applying a function to multiple arguments
    • rapply for a ‘recursive’ version of lapply
    • eapply for applying a function to each entry in an ‘environment’
# #Subset Dataframe 
bb <- xxflights
data_8 <- bb[ , c("dep_delay", "arr_delay", "dep_time")]
#data_8 <- bb %>% select(dep_delay, arr_delay, dep_time) 
#
# #Remove missing values
data_9 <- na.omit(data_8)
#
# #Calculate Columnwise Mean
(bb_1 <- apply(data_9, 2, mean))
##   dep_delay   arr_delay    dep_time 
##   12.555156    6.895377 1348.789883
bb_2 <- unlist(lapply(data_9, mean))
bb_3 <- sapply(data_9, mean)
bb_4 <- vapply(data_9, mean, numeric(1))
#
stopifnot(all(identical(bb_1, bb_2), identical(bb_1, bb_3), identical(bb_1, bb_4)))

2.8 Vectors

Refer The 6 Datatypes of Atomic Vectors

Create a Basic Tibble, Table2.1, for evaluating ‘is.x()’ series of functions in Base R

  • anyNA() is TRUE if there is an NA present, FALSE otherwise
  • is.atomic() is TRUE for All Atomic Vectors, factor, matrix but NOT for list
  • is.vector() is TRUE for All Atomic Vectors, list but NOT for factor, matrix, DATE & POSIXct
    • Caution: With vapply() it returns TRUE for matrix (it checks individual elements)
    • Caution: FALSE if the vector has attributes (except names) ex: DATE & POSIXct
  • is.numeric() is TRUE for both integer and double
  • is.integer(), is.double(), is.character(), is.logical() are TRUE for their respective datatypes only
  • is.factor(), is.ordered() are membership functions for factors with or without ordering
    • For more: nlevels(), levels()
  • lubridate
    • is.timepoint() is TRUE for POSIXct, POSIXlt, or Date
    • is.POSIXt(), is.Date() are TRUE for their respective datatypes only
Table 2.1: (B10T01) Vector Classes
ii dd cc ll ff fo dtm dat
1 1 a FALSE odd odd 2021-12-17 22:26:05 2021-12-18
2 2 b TRUE even even 2021-12-17 22:26:06 2021-12-19
3 3 c FALSE odd odd 2021-12-17 22:26:07 2021-12-20
4 4 d TRUE even even 2021-12-17 22:26:08 2021-12-21
5 5 e FALSE odd odd 2021-12-17 22:26:09 2021-12-22
6 6 f TRUE even even 2021-12-17 22:26:10 2021-12-23

Basic Tibble

is

# #Validation
# #anyNA() is TRUE if there is an NA present, FALSE otherwise
vapply(bb, anyNA, logical(1))
##    ii    dd    cc    ll    ff    fo   dtm   dat 
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#
# #is.atomic() is TRUE for All Atomic Vectors, factor, matrix but NOT for list
vapply(bb, is.atomic, logical(1))
##   ii   dd   cc   ll   ff   fo  dtm  dat 
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#
# #is.vector() is TRUE for All Atomic Vectors, list but NOT for factor, matrix, DATE & POSIXct
# #CAUTION: With vapply() it returns TRUE for matrix (it checks individual elements)
# #CAUTION: FALSE if the vector has attributes (except names) ex: DATE & POSIXct
vapply(bb, is.vector, logical(1))
##    ii    dd    cc    ll    ff    fo   dtm   dat 
##  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
#
# #is.numeric() is TRUE for both integer and double
vapply(bb, is.numeric, logical(1))
##    ii    dd    cc    ll    ff    fo   dtm   dat 
##  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
#
# #is.integer() is TRUE only for integer
vapply(bb, is.integer, logical(1))
##    ii    dd    cc    ll    ff    fo   dtm   dat 
##  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#
# #is.double() is TRUE only for double
vapply(bb, is.double, logical(1))
##    ii    dd    cc    ll    ff    fo   dtm   dat 
## FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE
#
# #is.character() is TRUE only for character
vapply(bb, is.character, logical(1))
##    ii    dd    cc    ll    ff    fo   dtm   dat 
## FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
#
# #is.logical() is TRUE only for logical
vapply(bb, is.logical, logical(1))
##    ii    dd    cc    ll    ff    fo   dtm   dat 
## FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE

Factor

# #Factors
# #is.factor() is TRUE only for factor (unordered or ordered)
vapply(bb, is.factor, logical(1))
##    ii    dd    cc    ll    ff    fo   dtm   dat 
## FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE
#
# #is.ordered() is TRUE only for ordered factor
vapply(bb, is.ordered, logical(1))
##    ii    dd    cc    ll    ff    fo   dtm   dat 
## FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
#
# #nlevels()
vapply(bb, nlevels, integer(1))
##  ii  dd  cc  ll  ff  fo dtm dat 
##   0   0   0   0   2   2   0   0
#
# #levels()
vapply(bb, function(x) !is.null(levels(x)), logical(1))
##    ii    dd    cc    ll    ff    fo   dtm   dat 
## FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE
#
# #table()
table(bb$ff)
## 
## even  odd 
##    3    3

lubridate::is

# #Package lubridate covers the missing functions for POSIXct, POSIXlt, or Date 
# #is.timepoint() is TRUE for POSIXct, POSIXlt, or Date
vapply(bb, is.timepoint, logical(1))
##    ii    dd    cc    ll    ff    fo   dtm   dat 
## FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
#
# #is.POSIXt() is TRUE only for POSIXct 
vapply(bb, is.POSIXt, logical(1))
##    ii    dd    cc    ll    ff    fo   dtm   dat 
## FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
#
# #is.Date() is only TRUE for DATE 
vapply(bb, is.Date, logical(1))
##    ii    dd    cc    ll    ff    fo   dtm   dat 
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

Duplicates

# #Which Columns have Duplicate Values
vapply(bb, function(x) anyDuplicated(x) != 0L, logical(1))
##    ii    dd    cc    ll    ff    fo   dtm   dat 
## FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE

2.9 Factors

Definition 2.3 Factors are the data objects which are used to categorize the data and store it as levels.

They can store both strings and integers. They are useful in the columns which have a limited number of unique values. Like “Male, Female” and “True, False” etc. They are useful in data analysis for statistical modelling.

Factor is nothing but the numeric representation of the character vector.

as.factor() vs. factor()

  • as.factor() is faster than factor() when input is a factor or integer
  • as.factor retains unused or NA levels whereas factor drops them
    • levels can also be dropped using droplevels()

Transformation

str(bb$ll)
##  logi [1:6] FALSE TRUE FALSE TRUE FALSE TRUE
# #Coercion to Factor
bb$new <- as.factor(bb$ll)
str(bb$new)
##  Factor w/ 2 levels "FALSE","TRUE": 1 2 1 2 1 2
#
# #table()
table(bb$ll)
## 
## FALSE  TRUE 
##     3     3
table(bb$new)
## 
## FALSE  TRUE 
##     3     3
#
# #Levels can be Labelled differently also
str(bb$ff)
##  Factor w/ 2 levels "even","odd": 2 1 2 1 2 1
# # 
str(factor(bb$ff, levels = c("even", "odd"), labels = c("day", "night")))
##  Factor w/ 2 levels "day","night": 2 1 2 1 2 1
str(factor(bb$ff, levels = c("odd", "even"), labels = c("day", "night")))
##  Factor w/ 2 levels "day","night": 1 2 1 2 1 2
#
# #Coercion from Factor to character, logical etc.
bb$xcc <- as.character(bb$new)
bb$xll <- as.logical(bb$new)
#
str(bb)
## tibble [6 x 11] (S3: tbl_df/tbl/data.frame)
##  $ ii : int [1:6] 1 2 3 4 5 6
##  $ dd : num [1:6] 1 2 3 4 5 6
##  $ cc : chr [1:6] "a" "b" "c" "d" ...
##  $ ll : logi [1:6] FALSE TRUE FALSE TRUE FALSE TRUE
##  $ ff : Factor w/ 2 levels "even","odd": 2 1 2 1 2 1
##  $ fo : Ord.factor w/ 2 levels "even"<"odd": 2 1 2 1 2 1
##  $ dtm: POSIXct[1:6], format: "2021-12-17 22:26:05" "2021-12-17 22:26:06" "2021-12-17 22:26:07" ...
##  $ dat: Date[1:6], format: "2021-12-18" "2021-12-19" "2021-12-20" ...
##  $ new: Factor w/ 2 levels "FALSE","TRUE": 1 2 1 2 1 2
##  $ xcc: chr [1:6] "FALSE" "TRUE" "FALSE" "TRUE" ...
##  $ xll: logi [1:6] FALSE TRUE FALSE TRUE FALSE TRUE

Flights

bb <- xxflights
aa <- c("month", "day")
str(bb[aa])
## tibble [336,776 x 2] (S3: tbl_df/tbl/data.frame)
##  $ month: num [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
##  $ day  : num [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
# #To factor
bb$day <- as.factor(bb$day)
bb$month <- as.factor(bb$month)
# #Equivalent
#bb[aa] <- lapply(bb[aa], as.factor)
str(bb[aa])
## tibble [336,776 x 2] (S3: tbl_df/tbl/data.frame)
##  $ month: Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ day  : Factor w/ 31 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...

2.10 Lists

Definition 2.4 Lists are by far the most flexible data structure in R. They can be seen as a collection of elements without any restriction on the class, length or structure of each element.

Caution: The only thing you need to take care of, is that you do not give two elements the same name. R will NOT throw ERROR.

Definition 2.5 Data Frames are lists with restriction that all elements of a data frame are of equal length.

Due to the resulting two-dimensional structure, data frames can mimic some of the behaviour of matrices. You can select rows and do operations on rows. You can not do that with lists, as a row is undefined there.

A Dataframe is intended to be used as a relational table. This means that elements in the same column are related to each other in the sense that they are all measures of the same metric. And, elements in the same row are related to each other in the sense that they are all measures from the same observation or measures of the same item. This is why when you look at the structure of a Dataframe, it will state the the number of observations and the number of variables instead of the number of rows and columns.

Dataframes are distinct from Matrices because they can include heterogenous data types among columns/variables. Dataframes do not permit multiple data types within a column/variable, for reasons that also follow from the relational table idea.

All this implies that you should use a data frame for any dataset that fits in that two-dimensional structure. Essentially, you use data frames for any dataset where a column coincides with a variable and a row coincides with a single observation in the broad sense of the word. For all other structures, lists are the way to go.

  • Does everything in R have (exactly one) class
    • Everything has (at least one) class. Objects can have multiple classes
    • It is mostly just the class attribute of an object. But when the class attribute is not set, the class() function makes up a class from the object ‘type’ and the ‘dim’ attribute.
    • lists and dataframes have same typeof ‘list’ but different class
  • Then what does typeof() tell us
    • It tells us the internal ‘storage mode’ of an object. How the R perceives the object and interacts with it.
    • An object has one and only one mode Difference between mode and class
    • class is an attribute and thus can be defined/overridden by a user, however, mode (i.e. typeof ) can not be
  • To define an object, what should be known about it
    • class(), typeof(), is(), attributes(), str(), inherits(), …

list

# #CAUTION: Do not Create a list with duplicate names (R will NOT throw ERROR)
bb <- list(a=1, b=2, a=3)
# # 3rd index can not be accessed using $
bb$a
## [1] 1
identical(bb$a, bb[[1]])
## [1] TRUE
identical(bb$a, bb[[3]])
## [1] FALSE
bb[[3]]
## [1] 3

class vs. typeof

# #Create a list
bb_lst <- list( a = c(1, 2), b = c('a', 'b', 'c'))
tryCatch(
# #Trying to create varying length of variables in dataframe like in list
  bb_dft <- data.frame(a = c(1,2), b = c('a', 'b', 'c')), 
  error = function(e) {
# #Print ERROR
    cat(paste0(e))
# #Double Arrow Assignment '<<-' to assign in parent environment
    bb_dft <<- data.frame(a = c(1, 2), b = c('a', 'b'))
    }
  )
## Error in data.frame(a = c(1, 2), b = c("a", "b", "c")): arguments imply differing number of rows: 2, 3
#
# #Both list and dataframe have same type() 
typeof(bb_lst)
## [1] "list"
typeof(bb_dft)
## [1] "list"
#
# #But, class() is different for list and dataframe
class(bb_lst)
## [1] "list"
class(bb_dft)
## [1] "data.frame"
#
str(bb_lst)
## List of 2
##  $ a: num [1:2] 1 2
##  $ b: chr [1:3] "a" "b" "c"
str(bb_dft)
## 'data.frame':    2 obs. of  2 variables:
##  $ a: num  1 2
##  $ b: chr  "a" "b"
#
# #Although 'bb_lst_c' is a list but inside coercion takes place i.e. '9' is character
bb_lst_c <- list( a = c(8, 'x'), b = c('y', 9))
str(bb_lst_c[[2]][2])
##  chr "9"
#
# #Here, '9' is numeric, it is stored as list element so note the extra [[]]
bb_lst_l <- list( a = list(8, 'x'), b = list('y', 9))
str(bb_lst_l[[2]][[2]])
##  num 9

2.11 Matrix

# #Create a Matrix
bb_mat <- matrix(1:6, nrow = 2, ncol = 3)
print(bb_mat)
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
str(bb_mat)
##  int [1:2, 1:3] 1 2 3 4 5 6
class(bb_mat)
## [1] "matrix" "array"
typeof(bb_mat)
## [1] "integer"

2.12 Merge

# #Basic Tibble
bb <- xxbasic10
str(bb)
## tibble [6 x 8] (S3: tbl_df/tbl/data.frame)
##  $ ii : int [1:6] 1 2 3 4 5 6
##  $ dd : num [1:6] 1 2 3 4 5 6
##  $ cc : chr [1:6] "a" "b" "c" "d" ...
##  $ ll : logi [1:6] FALSE TRUE FALSE TRUE FALSE TRUE
##  $ ff : Factor w/ 2 levels "even","odd": 2 1 2 1 2 1
##  $ fo : Ord.factor w/ 2 levels "even"<"odd": 2 1 2 1 2 1
##  $ dtm: POSIXct[1:6], format: "2021-12-17 22:26:05" "2021-12-17 22:26:06" "2021-12-17 22:26:07" ...
##  $ dat: Date[1:6], format: "2021-12-18" "2021-12-19" "2021-12-20" ...
# #Split with 'cc' as common ID column
bb_a <- bb[1:3]
bb_b <- bb[3:ncol(bb)]
#
# #merge() using the common ID column 'cc'
bb_new <- merge(bb_a, bb_b, by = "cc")
bb_new
##   cc ii dd    ll   ff   fo                 dtm        dat
## 1  a  1  1 FALSE  odd  odd 2021-12-17 22:26:05 2021-12-18
## 2  b  2  2  TRUE even even 2021-12-17 22:26:06 2021-12-19
## 3  c  3  3 FALSE  odd  odd 2021-12-17 22:26:07 2021-12-20
## 4  d  4  4  TRUE even even 2021-12-17 22:26:08 2021-12-21
## 5  e  5  5 FALSE  odd  odd 2021-12-17 22:26:09 2021-12-22
## 6  f  6  6  TRUE even even 2021-12-17 22:26:10 2021-12-23

2.13 Sort

  • sort()
    • It sorts the vector in an ascending order
  • rank()
    • rank returns the order of each element in an ascending list
    • The smallest number receives the rank 1
    • If there are ties, it returns numeric not integer with ranks being 2.5 etc
  • order()
    • order returns the index each element would have in an ascending list
  • dplyr::arrange()
    • arrange() orders the rows of a data frame by the values of selected columns.
    • NA are always sorted to the end, even when wrapped with desc().
ERROR 2.2 Error in arrange(bb, day) : could not find function "arrange"
  • Load the Package (dplyr etc.) having the function.

order

bb <- xxflights
# #Sort ascending (default)
bb_1 <- bb[order(bb$dep_delay), ]
# #Sort descending
bb_2 <- bb[order(-bb$dep_delay), ]
#
bb[1:5, c("dep_time", "dep_delay", "tailnum", "carrier")]
## # A tibble: 5 x 4
##   dep_time dep_delay tailnum carrier
##      <dbl>     <dbl> <chr>   <chr>  
## 1      517         2 N14228  UA     
## 2      533         4 N24211  UA     
## 3      542         2 N619AA  AA     
## 4      544        -1 N804JB  B6     
## 5      554        -6 N668DN  DL
bb_1[1:5, c("dep_time", "dep_delay", "tailnum", "carrier")]
## # A tibble: 5 x 4
##   dep_time dep_delay tailnum carrier
##      <dbl>     <dbl> <chr>   <chr>  
## 1     2040       -43 N592JB  B6     
## 2     2022       -33 N612DL  DL     
## 3     1408       -32 N825AS  EV     
## 4     1900       -30 N934DL  DL     
## 5     1703       -27 N208FR  F9
bb_2[1:5, c("dep_time", "dep_delay", "tailnum", "carrier")]
## # A tibble: 5 x 4
##   dep_time dep_delay tailnum carrier
##      <dbl>     <dbl> <chr>   <chr>  
## 1      641      1301 N384HA  HA     
## 2     1432      1137 N504MQ  MQ     
## 3     1121      1126 N517MQ  MQ     
## 4     1139      1014 N338AA  AA     
## 5      845      1005 N665MQ  MQ

Multi Column

bb <- xxbasic10
bb
## # A tibble: 6 x 8
##      ii    dd cc    ll    ff    fo    dtm                 dat       
##   <int> <dbl> <chr> <lgl> <fct> <ord> <dttm>              <date>    
## 1     1     1 a     FALSE odd   odd   2021-12-17 22:26:05 2021-12-18
## 2     2     2 b     TRUE  even  even  2021-12-17 22:26:06 2021-12-19
## 3     3     3 c     FALSE odd   odd   2021-12-17 22:26:07 2021-12-20
## 4     4     4 d     TRUE  even  even  2021-12-17 22:26:08 2021-12-21
## 5     5     5 e     FALSE odd   odd   2021-12-17 22:26:09 2021-12-22
## 6     6     6 f     TRUE  even  even  2021-12-17 22:26:10 2021-12-23
# #Sort ascending (default)
(bb_1 <- bb[order(bb$ll), ])
## # A tibble: 6 x 8
##      ii    dd cc    ll    ff    fo    dtm                 dat       
##   <int> <dbl> <chr> <lgl> <fct> <ord> <dttm>              <date>    
## 1     1     1 a     FALSE odd   odd   2021-12-17 22:26:05 2021-12-18
## 2     3     3 c     FALSE odd   odd   2021-12-17 22:26:07 2021-12-20
## 3     5     5 e     FALSE odd   odd   2021-12-17 22:26:09 2021-12-22
## 4     2     2 b     TRUE  even  even  2021-12-17 22:26:06 2021-12-19
## 5     4     4 d     TRUE  even  even  2021-12-17 22:26:08 2021-12-21
## 6     6     6 f     TRUE  even  even  2021-12-17 22:26:10 2021-12-23
# #Sort on Multiple Columns with ascending and descending
(bb_2 <- bb[order(bb$ll, -bb$dd), ])
## # A tibble: 6 x 8
##      ii    dd cc    ll    ff    fo    dtm                 dat       
##   <int> <dbl> <chr> <lgl> <fct> <ord> <dttm>              <date>    
## 1     5     5 e     FALSE odd   odd   2021-12-17 22:26:09 2021-12-22
## 2     3     3 c     FALSE odd   odd   2021-12-17 22:26:07 2021-12-20
## 3     1     1 a     FALSE odd   odd   2021-12-17 22:26:05 2021-12-18
## 4     6     6 f     TRUE  even  even  2021-12-17 22:26:10 2021-12-23
## 5     4     4 d     TRUE  even  even  2021-12-17 22:26:08 2021-12-21
## 6     2     2 b     TRUE  even  even  2021-12-17 22:26:06 2021-12-19
#
stopifnot(identical(bb_2, arrange(bb, ll, -dd)))

Validation


3 Data Manipulation (B11, Sep-12)

3.1 Overview

3.2 Get Help

# #To get the Help files on any Topic including 'loaded' Packages
?dplyr
?mutate
# #To get the Help files on any Topic including functions from 'not loaded' Packages
?dplyr::mutate()
# #Operators need Backticks i.e. ` . In keyboards it is located below 'Esc' Key
?`:`

3.3 Logical Operators and Functions

  • |”      (Or, binary, vectorized)
  • ||”     (Or, binary, not vectorized)
  • &”     (And, binary, vectorized)
  • &&” (And, binary, not vectorized)
  • Functions - any(), all()

Overview

  • Vectorised forms are “&” “|”
    • Thus, these compare vectors elementwise and operate over complete vector length.
    • NA is a valid logical object. Where a component of x or y is NA, the result will be NA if the outcome is ambiguous.
    • All components of x or y are evaluated
    • (recycling) of elements occur if vector lengths are different
    • These are NOT recommended for use inside if() clauses
    • These are generally used for filtering
    • &, | do the pairwise operation in R (vs bitwise in Python, C etc.)
  • Non-vectorised forms are “&&” “||”
    • These examine only the first element of each vector
      • Caution: For these, vector length should always be 1
      • Use all() and any() to reduce the length to one
    • (short-circuit) These stop execution as soon as these find at least one definite condition i.e. TRUE for ||, FALSE for &&.
      • They will not evaluate the second operand if the first operand is enough to determine the value of the expression.
    • These are preferred in if() clauses
    • &&, || do the bitwise operation in R (vs pairwise in Python, C etc.)
  • all() and any()
    • all() : Are All Values TRUE
      • TRUE for 0-length vector
    • any() : Is at least one of the values TRUE
      • FALSE for 0-length vector
    • The value is a logical vector of length one being TRUE, FALSE, or NA.

Operators

# #At lease one TRUE is present
NA | TRUE
## [1] TRUE
# #Depending upon what the unknown is, the outcome will change
NA | FALSE
## [1] NA
# #Depending upon what the unknown is, the outcome will change
NA & TRUE
## [1] NA
# #At lease one FALSE is present
NA & FALSE 
## [1] FALSE
#
# #For length 1 vectors, output of vectorised and non-vectorised forms is same
stopifnot(all(identical(NA || TRUE, NA | TRUE), identical(NA || FALSE, NA | FALSE),
              identical(NA && TRUE, NA & TRUE), identical(NA && FALSE, NA & FALSE)))
#
# #But for vectors of >1 length, output is different
x <- 1:5
y <- 5:1
(x > 2) & (y < 3)
## [1] FALSE FALSE FALSE  TRUE  TRUE
(x > 2) && (y < 3)
## [1] FALSE
#
# # '&&' evaluates only the first element of Vector, thus caution is advised
TRUE & c(TRUE, FALSE)
## [1]  TRUE FALSE
TRUE & c(FALSE, FALSE)
## [1] FALSE FALSE
TRUE && c(TRUE, FALSE)
## [1] TRUE
TRUE && c(FALSE, FALSE)
## [1] FALSE
TRUE && all(c(TRUE, FALSE))
## [1] FALSE
TRUE && any(c(TRUE, FALSE))
## [1] TRUE

Evaluation

if(exists("x")) rm(x)
exists("x")
## [1] FALSE
#
# # No short-circuit for "|" or "&", Evaluates Right and throws Error
tryCatch( TRUE | x, error = function(e) cat(paste0(e)))
## Error in doTryCatch(return(expr), name, parentenv, handler): object 'x' not found
tryCatch( FALSE & x, error = function(e) cat(paste0(e)))
## Error in doTryCatch(return(expr), name, parentenv, handler): object 'x' not found
#
# #Does not evaluate Right input because outcome already determined
tryCatch( TRUE || x, error = function(e) cat(paste0(e)))
## [1] TRUE
tryCatch( FALSE && x, error = function(e) cat(paste0(e)))
## [1] FALSE
# #evaluates Right input because outcome can not be determined and throws error
tryCatch( TRUE && x, error = function(e) cat(paste0(e)))
## Error in doTryCatch(return(expr), name, parentenv, handler): object 'x' not found

AnyAll

# #any()
any(NA, TRUE)
## [1] TRUE
any(NA, FALSE)
## [1] NA
any(NA, TRUE, na.rm = TRUE)
## [1] TRUE
any(NA, FALSE, na.rm = TRUE)
## [1] FALSE
any(character(0))
## [1] FALSE
#
# #all()
all(NA, TRUE)
## [1] NA
all(NA, FALSE)
## [1] FALSE
all(NA, TRUE, na.rm = TRUE)
## [1] TRUE
all(NA, FALSE, na.rm = TRUE)
## [1] FALSE
all(character(0))
## [1] TRUE

3.4 Relational Operators

\(>\) , \(<\) , \(==\) , \(>=\) , \(<=\) , \(!=\)

3.5 Filter

  • dplyr::filter()
  • subset() vs. filter() -
    • Caution: R Manual itself warns against usage of subset(). It is better to use [] for subsetting
    • Caution: NOT Verified Yet
      • subset works on matrices, however, filter does not
      • subset does not work on databases, filter does
      • subset does not drop the rownames, however, filter removes them
      • filter preserves the class of the column, subset does not
      • filter works with grouped data, subset ignores them
    • filter is stricter and thus would lead to fewer causes of unexpected outcome
ERROR 3.1 Error in match.arg(method) : object ’day’ not found
  • when ‘dplyr’ package is not loaded, base::filter() throws this error.
  • Either Load the Package (dplyr etc.) or use scope resolution ‘::’

Basics

# #dplyr::filter() - Filter Rows based on Multiple Columns
bb_1 <- filter(bb, month == 1, day == 1)
dim(bb_1)
## [1] 842  19
# #Filtering by multiple criteria within a single logical expression
stopifnot(identical(bb_1, filter(bb, month == 1 & day == 1)))
#
if(anyNA(bb_1)) {
  bb_na <- na.omit(bb_1)
  print(paste0("Note: ", length(attributes(bb_na)$na.action), " rows removed."))
} else {
  print("NA not found")
}
## [1] "Note: 11 rows removed."
dim(bb_na)
## [1] 831  19

Conditional

dim(bb)
## [1] 336776     19
#
# #Flights in either months of November or Decemeber
dim(bb_2 <- filter(bb, month == 11 | month == 12))
## [1] 55403    19
#
# #Flights with arrival delay '<= 120' or departure delay '<= 120' 
# #It excludes flights where arrival & departure BOTH are delayed by >2 hours
# #If either delay is less than 2 hours, the flight is included
dim(bb_3 <- filter(bb, arr_delay <= 120 | dep_delay <= 120))
## [1] 320060     19
dim(bb_4 <- filter(bb, !(arr_delay > 120 & dep_delay > 120)))
## [1] 320060     19
dim(bb_5 <- filter(bb, (!arr_delay > 120 | !dep_delay > 120)))
## [1] 320060     19
#
# #Destination to IAH or HOU
dim(bb_6 <- filter(bb, dest == "IAH" | dest == "HOU"))
## [1] 9313   19
dim(bb_7 <- filter(bb, dest %in% c("IAH", "HOU")))
## [1] 9313   19
#
# #Carrier being "UA", "US", "DL"
dim(bb_8 <- filter(bb, carrier == "UA" | carrier == "US" | carrier == "DL"))
## [1] 127311     19
dim(bb_9 <- filter(bb, carrier %in% c("UA", "US", "DL")))
## [1] 127311     19
#
# #Did not leave late (before /on time departure) but Arrived late by >2 hours
dim(bb_10 <- filter(bb, (arr_delay > 120) & !(dep_delay > 0)))
## [1] 29 19
# 
# #Departed between midnight and 6 AM (inclusive)
dim(bb_11 <- filter(bb, (sched_dep_time >= 00 & sched_dep_time <= 600)))
## [1] 8970   19

subset()

# #subset() - Recommendation is against its usage. Use either '[]' or filter()
dim(bb_12 <- subset(bb, month == 1 | !(dep_delay >= 120), 
                    select = c("flight", "arr_delay")))
## [1] 319760      2
dim(bb_13 <- subset(bb, month == 1 | !(dep_delay >= 120) | carrier == "DL", 
                select = c("flight", "arr_delay")))
## [1] 321139      2

3.6 Subsetting

\([ \ \ ]\) , \([[ \ \ ]]\) , \(\$\)

  • Extract or Replace Parts of an Object
    • Operators acting on vectors, matrices, arrays and lists to extract or replace parts.
    • The most important distinction between “[ ],” “[[ ]]” and “$” is that the “[ ]” can select more than one element whereas the other two select a single element.
    • “$” does not allow computed indices, whereas “[[ ]]” does.
    • Subsetting (except by an empty index) will drop all attributes except names, dim and dimnames. Indexing will keep them.
ERROR 3.2 Error in day == 1 : comparison (1) is possible only for atomic and list types
  • It occurs when the data is not available i.e. column name is NOT found
  • It might happen when the original code assumed that the dataframe is attached
  • Either attach the dataframe (NOT Recommended) or use “$” to access column names

dplyr::select()

  • It can use Range “:” Not “!” And “&,” Or “|
  • Selection Helpers
    • everything(): Matches all variables.
    • last_col(): Select last variable, possibly with an offset.
  • These helpers select variables by matching patterns in their names:
    • starts_with(): Starts with a prefix.
    • ends_with(): Ends with a suffix.
    • contains(): Contains a literal string.
    • matches(): Matches a regular expression.
    • num_range(): Matches a numerical range like x01, x02, x03.
  • These helpers select variables from a character vector:
    • all_of(): Matches variable names in a character vector. All names must be present, otherwise an out-of-bounds error is thrown.
    • any_of(): Same as all_of(), except that no error is thrown for names that do not exist.
  • This helper selects variables with a function:
    • where(): Applies a function to all variables and selects those for which the function returns TRUE.

Cols

dim(bb)
## [1] 336776     19
#
# #Subset Consecutive Columns using Colon
stopifnot(identical(bb[ , 2:5], bb[ , -c(1, 6:ncol(bb))]))
#
# #dplyr::select()
bb_14 <- select(bb, year:day, arr_delay, dep_delay, distance, air_time)
bb_15 <- bb %>% select(year:day, arr_delay, dep_delay, distance, air_time)
stopifnot(identical(bb_14, bb_15))

Rows

dim(bb)
## [1] 336776     19
#
# #Subset Rows
dim(bb[which(bb$day == 1 & !(bb$month ==1)), ])
## [1] 10194    19
dim(bb[which(bb$day == 1 | bb$month ==1), ])
## [1] 37198    19
dim(bb[which(bb$day == 1 & bb$month ==1), ])
## [1] 842  19
dim(bb[which(bb$day == 1, bb$month ==1), ])
## [1] 11036    19
dim(bb[which(bb$day == 1 & !(bb$carrier == "DL")), ])
## [1] 9482   19
dim(bb[which(bb$day == 1 | bb$carrier == "DL"), ])
## [1] 57592    19
dim(bb[which(bb$day == 1 & bb$carrier == "DL"), ])
## [1] 1554   19
dim(bb[which(bb$day == 1, bb$carrier == "DL"), ])
## [1] 11036    19

3.7 Grouped Summary

  • dplyr::summarise() or dplyr::summarize()
  • dplyr::group_by()
    • It converts an existing Tibble into a grouped Tibble where operations are performed “by group.”
    • ungroup() removes grouping.
    • n() gives the number of observations in the current group.

Summarise

bb <- xxflights
# #dplyr::summarise() & dplyr::summarize() are same
# #Get the mean of a column with NA excluded
#
summarize(bb, delay_mean = mean(dep_delay, na.rm = TRUE))
## # A tibble: 1 x 1
##   delay_mean
##        <dbl>
## 1       12.6
#
# #base::summary()
summary(bb$dep_delay)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -43.00   -5.00   -2.00   12.64   11.00 1301.00    8255
#
# #Grouped Summary
by_ymd <- group_by(bb, year, month, day)
mysum <- summarize(by_ymd, 
                   dep_delay_mean = mean(dep_delay, na.rm = TRUE), 
                   arr_delay_mean = mean(arr_delay, na.rm = TRUE),
                   .groups = "keep")
# #Equivalent 
bb %>% 
  group_by(year, month, day) %>% 
  summarize(dep_delay_mean = mean(dep_delay, na.rm = TRUE), 
            arr_delay_mean = mean(arr_delay, na.rm = TRUE),
            .groups= "keep")
## # A tibble: 365 x 5
## # Groups:   year, month, day [365]
##     year month   day dep_delay_mean arr_delay_mean
##    <dbl> <dbl> <dbl>          <dbl>          <dbl>
##  1  2013     1     1          11.5          12.7  
##  2  2013     1     2          13.9          12.7  
##  3  2013     1     3          11.0           5.73 
##  4  2013     1     4           8.95         -1.93 
##  5  2013     1     5           5.73         -1.53 
##  6  2013     1     6           7.15          4.24 
##  7  2013     1     7           5.42         -4.95 
##  8  2013     1     8           2.55         -3.23 
##  9  2013     1     9           2.28         -0.264
## 10  2013     1    10           2.84         -5.90 
## # ... with 355 more rows

group_by()

# #Get delay grouped by distance 'Distance between airports, in miles.'
summary(bb$distance)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      17     502     872    1040    1389    4983
#
# #How many unique values are present in this numeric data i.e. factors
str(as.factor(bb$distance))
##  Factor w/ 214 levels "17","80","94",..: 163 165 145 171 106 96 138 22 120 99 ...
str(sort(unique(bb$distance)))
##  num [1:214] 17 80 94 96 116 143 160 169 173 184 ...
bb %>% 
  group_by(distance) %>% 
  summarize(count = n(),
            dep_delay_mean = mean(dep_delay, na.rm = TRUE), 
            arr_delay_mean = mean(arr_delay, na.rm = TRUE),
            .groups= "keep")
## # A tibble: 214 x 4
## # Groups:   distance [214]
##    distance count dep_delay_mean arr_delay_mean
##       <dbl> <int>          <dbl>          <dbl>
##  1       17     1         NaN           NaN    
##  2       80    49          18.9          16.5  
##  3       94   976          17.5          12.7  
##  4       96   607           3.19          5.78 
##  5      116   443          17.7           7.05 
##  6      143   439          23.6          14.4  
##  7      160   376          21.8          16.2  
##  8      169   545          18.5          15.1  
##  9      173   221           7.05         -0.286
## 10      184  5504           3.07          0.123
## # ... with 204 more rows
#
# #For distance =17, there is only 1 flight and that too has NA, so the mean is NaN
bb[bb$distance == 17, ]
## # A tibble: 1 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
##   <dbl> <dbl> <dbl>    <dbl>          <dbl>     <dbl>    <dbl>          <dbl>     <dbl> <chr>  
## 1  2013     7    27       NA            106        NA       NA            245        NA US     
## # ... with 9 more variables: flight <dbl>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
#
# #In general, Flight to any destination (ex: ABQ) has travelled same distance (1826)
unique(bb %>% filter(dest == "ABQ") %>% select(distance))
## # A tibble: 1 x 1
##   distance
##      <dbl>
## 1     1826
#
# #Mean Delays for Destinations with more than 1000 miles distance
bb %>% 
  group_by(dest) %>% 
  filter(distance > 1000) %>% 
  summarize(count = n(), 
            distance_mean = mean(distance, na.rm = TRUE),
            dep_delay_mean = mean(dep_delay, na.rm = TRUE), 
            arr_delay_mean = mean(arr_delay, na.rm = TRUE))
## # A tibble: 48 x 5
##    dest  count distance_mean dep_delay_mean arr_delay_mean
##    <chr> <int>         <dbl>          <dbl>          <dbl>
##  1 ABQ     254         1826           13.7           4.38 
##  2 ANC       8         3370           12.9          -2.5  
##  3 AUS    2439         1514.          13.0           6.02 
##  4 BQN     896         1579.          12.4           8.25 
##  5 BUR     371         2465           13.5           8.18 
##  6 BZN      36         1882           11.5           7.6  
##  7 DEN    7266         1615.          15.2           8.61 
##  8 DFW    8738         1383.           8.68          0.322
##  9 DSM     569         1021.          26.2          19.0  
## 10 EGE     213         1736.          15.5           6.30 
## # ... with 38 more rows

3.8 Mutate

  • dplyr::mutate()
    • Newly created variables are available immediately
    • New variables overwrite existing variables of the same name.
    • Variables can be removed by setting their value to NULL.
    • mutate() adds new variables and preserves existing ones
      • mutate() can also keep or drop column according to the .keep argument.
    • transmute() adds new variables and drops existing ones.
ERROR 3.3 Error in UseMethod("select") : no applicable method for ’select’ applied to an object of class "function"
  • Run ‘str(MyObject)’ to check if ‘MyObject’ exists, looks as expected and R is not finding something else.
  • Most probably R reserved keyword ‘data’ was called in place of the actual ‘data.’
  • To minimise this type of Error, do not use the keywords which match with Base R Functions e.g. ‘data’ (Function in utils) or ‘df’ (function in stats)
ERROR 3.4 Error: Problem with mutate() column ... column object ’arr_delay’ not found
  • Run ‘str(MyObject)’ to check if the column exists in the dataset
  • Caution: if the dataset was attached earlier, then R will NOT throw this error. However, later when the code is being executed in a clean environment, it will fail. To avoid this, it is recommended to use proper scope resolution and to avoid attaching the dataset (if possible)
dim(bb)
## [1] 336776     19
#
bb_16 <- select(bb, year:day, arr_delay, dep_delay, distance, air_time)
bb_17 <- mutate(bb_16,
       gain = arr_delay - dep_delay,
       speed = distance / air_time * 60,
       hours = air_time / 60,
       gain_per_hour = gain / hours)
# #Equivalent
bb %>% 
  select(year:day, arr_delay, dep_delay, distance, air_time) %>% 
  mutate(gain = arr_delay - dep_delay,
         speed = distance / air_time * 60,
         hours = air_time / 60,
         gain_per_hour = gain / hours)
## # A tibble: 336,776 x 11
##     year month   day arr_delay dep_delay distance air_time  gain speed hours gain_per_hour
##    <dbl> <dbl> <dbl>     <dbl>     <dbl>    <dbl>    <dbl> <dbl> <dbl> <dbl>         <dbl>
##  1  2013     1     1        11         2     1400      227     9  370. 3.78           2.38
##  2  2013     1     1        20         4     1416      227    16  374. 3.78           4.23
##  3  2013     1     1        33         2     1089      160    31  408. 2.67          11.6 
##  4  2013     1     1       -18        -1     1576      183   -17  517. 3.05          -5.57
##  5  2013     1     1       -25        -6      762      116   -19  394. 1.93          -9.83
##  6  2013     1     1        12        -4      719      150    16  288. 2.5            6.4 
##  7  2013     1     1        19        -5     1065      158    24  404. 2.63           9.11
##  8  2013     1     1       -14        -3      229       53   -11  259. 0.883        -12.5 
##  9  2013     1     1        -8        -3      944      140    -5  405. 2.33          -2.14
## 10  2013     1     1         8        -2      733      138    10  319. 2.3            4.35
## # ... with 336,766 more rows

Validation


4 Statistics (B12, Sep-26)

4.2 Definitions

6.20 A population is the set of all elements of interest in a particular study.

6.23 The process of conducting a survey to collect data for the entire population is called a census.

6.21 A sample is a subset of the population.

12.7 A random sample of size \({n}\) from an infinite population is a sample selected such that the following two conditions are satisfied. Each element selected comes from the same population. Each element is selected independently. The second condition prevents selection bias.

4.3 Inferential Statistics

6.25 Statistics uses data from a sample to make estimates and test hypotheses about the characteristics of a population through a process referred to as statistical inference.

Inferential statistics are used for Hypothesis Testing. Refer Statistical Inference

4.4 Hypothesis Testing

Refer Hypothesis Testing

14.1 Hypothesis testing is a process in which, using data from a sample, an inference is made about a population parameter or a population probability distribution.

14.2 Null Hypothesis \((H_0)\) is a tentative assumption about a population parameter. It is assumed True, by default, in the hypothesis testing procedure.

14.3 Alternative Hypothesis \((H_a)\) is the complement of the Null Hypothesis. It is concluded to be True, if the Null Hypothesis is rejected.

Refer Steps of Hypothesis Testing

  1. State the NULL Hypothesis \({H_0}\)
    • The null will always be in the form of decisions regarding the population, not the sample.
      • If we have population data, we can do the census and then there is no requirement of any hypothesis or estimation.
    • The Null Hypothesis will always be written as the absence of some parameter or process characteristic
      • The test is designed to assess the strength of the evidence against the null hypothesis.
      • Often the null hypothesis is a statement of “no difference.”
    • Equality part of expression always appears in \({H_0}\) i.e. it can be \(>=\) , \(<=\) , \(==\)
    • The term ‘null’ is used because this hypothesis assumes that there is no difference between the two means or that the recorded difference is not significant.
  2. An Alternative Hypothesis \({H_a}\), is then stated which will be the complement of the Null Hypothesis.
    • \({H_a}\) can not have equality part of expression i.e. it can be \(<\) , \(>\) , \(!=\)
    • The claim about the population that evidence is being sought for is the alternative hypothesis
      • However, to prove it is true, its complement (null hypothesis) is tried to be proven false. Because it is easier to prove something false.
  3. For Hypothesis tests involving a population mean, let \({\mu}_0\) denote the hypothesized value

14.4 \(\text{\{Left Tail or Lower Tail\} } {H_0} : {\mu} \geq {\mu}_0 \iff {H_a}: {\mu} < {\mu}_0\)

14.5 \(\text{\{Right Tail or Upper Tail\} } {H_0} : {\mu} \leq {\mu}_0 \iff {H_a}: {\mu} > {\mu}_0\)

14.6 \(\text{\{Two Tail\} } {H_0} :{\mu} = {\mu}_0 \iff {H_a}: {\mu} \neq {\mu}_0\)

  • Sample data is used to determine whether or not you can be statistically confident that you can reject or fail to reject the \({H_0}\).
    • If the \({H_0}\) is rejected, the statistical conclusion is that the \({H_a}\) is TRUE.
  • Notes:
    • Sometimes it is easier to formulate the alternative hypothesis (the conclusion that you hope to support) and create NULL hypothesis based on that.
    • Ex: If we are testing for validity of the claim that number of defects are less than 2%
      • \({H_a} : {\mu} < 2\% \iff {H_0} : {\mu} \geq 2\%\)
      • If the \({H_0}\) is rejected, then the statistical conclusion is that the \({H_a}\) is TRUE i.e. defects are less than 2% in the population
      • If the \({H_0}\) is not rejected, then no conclusion can be formed about the \({H_a}\).

Question: Is there an ideal sample size

  • NO
  • (“ForLater”) However, there exists a relationship between (I guess) alpha, beta and sample size n. (I could not find the link on later search.)
  • (Paraphrasing and only memory based so can be worng!) Basically, for a given analysis, if we want to keep both types of errors to a managable level, we can calculate minimum number of samples that would help us in determining the outcome at a certain minimum confidence level etc.

4.5 Point Estimation

12.9 To estimate the value of a population parameter, we compute a corresponding characteristic of the sample, referred to as a sample statistic. This process is called point estimation.

12.10 A sample statistic is the point estimator of the corresponding population parameter. For example, \(\overline{x}, s, s^2, s_{xy}, r_{xy}\) sample statics are point estimators for corresponding population parameters of \({\mu}\) (mean), \({\sigma}\) (standard deviation), \(\sigma^2\) (variance), \(\sigma_{xy}\) (covariance), \(\rho_{xy}\) (correlation)

12.11 The numerical value obtained for the sample statistic is called the point estimate. Estimate is used for sample value only, for population value it would be parameter. Estimate is a value while Estimator is a function.

Example: \({\overline{x}}\) is an estimator (of populataion parameter ‘mean’ \({\mu}\)). Its estimate is 3 and this calculation process is an estimation.

4.6 Standard Deviation

8.6 Given a data set \({X=\{x_1,x_2,\ldots,x_n\}}\), the mean \({\overline{x}}\) is the sum of all of the values \({x_1,x_2,\ldots,x_n}\) divided by the count \({n}\).

Refer Standard Deviation and equation (8.12)

8.12 The standard deviation (\(s, \sigma\)) is defined to be the positive square root of the variance. It is a measure of the amount of variation or dispersion of a set of values.

\[\begin{align} \sigma &= \sqrt{\frac{1}{N} \sum_{i=1}^N \left(x_i - \mu\right)^2} \\ {s} &= \sqrt{\frac{1}{N-1} \sum_{i=1}^N \left(x_i - \bar{x}\right)^2} \end{align}\]

A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the values are spread out over a wider range.

4.7 Variance

Refer Variance and equation (8.11)

8.11 The variance \(({\sigma}^2)\) is based on the difference between the value of each observation \({x_i}\) and the mean \({\overline{x}}\). The average of the squared deviations is called the variance.

\[\begin{align} \sigma^2 &= \frac{1}{n} \sum _{i=1}^{n} \left(x_i - \mu \right)^2 \\ s^2 &= \frac{1}{n-1} \sum _{i=1}^{n} \left(x_i - \overline{x} \right)^2 \end{align}\]

Variability is most commonly measured with the Range, IQR, SD, and Variance.

4.8 Standard Error or Sampling Fluctuation

The sample we draw from the population is only one from a large number of potential samples.

  • If ten researchers were all studying the same population, drawing their own samples then they may obtain different answers i.e. each of the ten researchers may come up with a different mean
  • Thus, the statistic in question (mean) varies for sample to sample. It has a distribution called a sampling distribution.
  • We can use this distribution to understand the uncertainty in our estimate of the population parameter.

Refer Standard Error

12.13 In general, standard error \(\sigma_{\overline{x}}\) refers to the standard deviation of a point estimator. The standard error of \({\overline{x}}\) is the standard deviation of the sampling distribution of \({\overline{x}}\).

12.14 A sampling error is the difference between a population parameter and a sample statistic.

Sampling fluctuation (Standard Error) refers to the extent to which a statistic (mean, median, mode, sd etc.) takes on different values with different samples i.e. it refers to how much the value of the statistic fluctuates from sample to sample.

12.12 The sampling distribution of \({\overline{x}}\) is the probability distribution of all possible values of the sample mean \({\overline{x}}\).

Standard Deviation of \({\overline{x}}\), \(\sigma_{\overline{x}}\) is given by equation (12.1) i.e. \(\sigma_{\overline{x}} = \frac{\sigma}{\sqrt{n}}\)

  • Generally, the standard error is unknown.
  • Higher the standard error, higher the deviation from sample to sample i.e. lower the reliability.

4.9 Test Statistic

Refer Test Statistic

14.11 Test statistic is a number calculated from a statistical test of a hypothesis. It shows how closely the observed data match the distribution expected under the null hypothesis of that statistical test. It helps determine whether a null hypothesis should be rejected.

For hypothesis tests about a population mean in the \({\sigma}\) known case, we use the standard normal random variable \({z}\) as a test statistic to determine whether \({\overline{x}}\) deviates from the hypothesized value of \({\mu}\) enough to justify rejecting the null hypothesis. As given in equation (14.1) i.e. \(z = \frac{\overline{x} - \mu_0}{\sigma_{\overline{x}}} = \frac{\overline{x} - \mu_0}{\sigma/\sqrt{n}}\)

4.10 Calculate SD & SE

Standard Error (SE) is same as ‘the standard deviation of the sampling distribution.’ The ‘variance of the sampling distribution’ is the Variance of the data divided by N.

Calculate Statistics

# #DataSet: Height of 5 people in 'cm'
hh <- c(170.5, 161, 160, 170, 150.5)
#
# #N by length()
print(hh_len <- length(hh))
## [1] 5
#
# #Mean by mean()
hh_mean <- mean(hh)
cat("Mean = ", hh_mean)
## Mean =  162.4
#
# #Variance by var()
hh_var <- round(var(hh), 3)
cat("Variance = ", hh_var)
## Variance =  68.175
#
# #Standard Deviation (SD) by sd()
hh_sd <- round(sd(hh), 3)
cat("Standard Deviation (SD) = ", hh_sd)
## Standard Deviation (SD) =  8.257
#
# #Standard Error (SE) 
hh_se_sd <- round(hh_sd / sqrt(hh_len), 3)
cat("Standard Error (SE) = ", hh_se_sd)
## Standard Error (SE) =  3.693

R Functions

# #DataSet: Height of 5 people in 'cm'
print(hh)
## [1] 170.5 161.0 160.0 170.0 150.5
#
# #N by length()
print(hh_len <- length(hh))
## [1] 5
#
# #sum by sum()
print(hh_sum <- sum(hh))
## [1] 812
#
# #Mean by mean()
hh_mean <- mean(hh)
hh_mean_cal <- hh_sum / hh_len
stopifnot(identical(hh_mean, hh_mean_cal))
cat("Mean = ", hh_mean)
## Mean =  162.4
#
# #Calculate the deviation from the mean by subtracting each value from the mean
print(hh_dev <- hh - hh_mean)
## [1]   8.1  -1.4  -2.4   7.6 -11.9
#
# #Square the deviation
print(hh_sqdev <- hh_dev^2)
## [1]  65.61   1.96   5.76  57.76 141.61
#
# #Get Sum of the squared deviations
print(hh_sqdev_sum <- sum(hh_sqdev))
## [1] 272.7
#
# #Divide it by the 'sample size (N) – 1' for the Variance or use var()
hh_var <- round(var(hh), 3)
hh_var_cal <- hh_sqdev_sum / (hh_len -1)
stopifnot(identical(hh_var, hh_var_cal))
cat("Variance = ", hh_var)
## Variance =  68.175
#
# #Variance of the sampling distribution 
hh_var_sample <- hh_var / hh_len
cat("Variance of the Sampling Distribution = ", hh_var)
## Variance of the Sampling Distribution =  68.175
#
# #Take square root of the Variance for the Standard Deviation (SD) or use sd()
hh_sd_cal <- round(sqrt(hh_var), 3)
hh_sd <- sd(hh)
stopifnot(identical(round(hh_sd, 3), hh_sd_cal))
cat("Standard Deviation (SD) = ", hh_sd)
## Standard Deviation (SD) =  8.256815
#
# #Standard Error (SE)
# #SE
# #Divide the SD by the square root of the sample size for the Standard Error (SE)
# #
hh_se_sd <- round(hh_sd / sqrt(hh_len), 3)
#
# #Calculate SE from Variance 
hh_se_var <- round(sqrt(hh_var_sample), 3)
stopifnot(identical(hh_se_sd, hh_se_var))
cat("Standard Error (SE) = ", hh_se_sd)
## Standard Error (SE) =  3.693

4.11 Histogram and Density

Using Dataset Flights : “air_time” -Amount of time spent in the air, in minutes. Refer figure 4.1

Graphs

Flights: Air Time (min) excluding NA (Histogram and Density)Flights: Air Time (min) excluding NA (Histogram and Density)

Figure 4.1 Flights: Air Time (min) excluding NA (Histogram and Density)

NA

# #Remove All NA
aa <- na.omit(xxflights$air_time)
attr(aa, "na.action") <- NULL
str(aa)
##  num [1:327346] 227 227 160 183 116 150 158 53 140 138 ...
summary(aa)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    20.0    82.0   129.0   150.7   192.0   695.0

Stats

# #Overview of Data after removal of NA
bb <- aa
stopifnot(is.null(dim(bb)))
summary(bb)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    20.0    82.0   129.0   150.7   192.0   695.0
# #min(), max(), range(), summary()
min_bb <- summary(bb)[1]
max_bb <- summary(bb)[6]
range_bb <- max_bb - min_bb
cat(paste0("Range = ", range_bb, " (", min_bb, ", ", max_bb, ")\n"))
## Range = 675 (20, 695)
# #IQR(), summary()
iqr_bb <- IQR(bb)
cat(paste0("IQR = ", iqr_bb, " (", summary(bb)[2], ", ", summary(bb)[5], ")\n"))
## IQR = 110 (82, 192)
# #median(), mean(), summary()[3], summary()[4] 
median_bb <- median(bb)
cat("Median =", median_bb, "\n")
## Median = 129
mu_mean_bb <- mean(bb)
cat("Mean \u03bc =", mu_mean_bb, "\n")
## Mean µ = 150.6865
#
sigma_sd_bb <- sd(bb)
cat("SD (sigma) \u03c3 =", sigma_sd_bb, "\n")
## SD (sigma) s = 93.6883
#
variance_bb <- var(bb)
cat(sprintf('Variance (sigma)%s %s%s =', '\u00b2', '\u03c3', '\u00b2'), variance_bb, "\n")
## Variance (sigma)² s² = 8777.498

Historgram

# #Histogram
bb <- na.omit(xxflights$air_time)
hh <- tibble(ee = bb)
# #Basics
median_hh <- round(median(hh[[1]]), 1)
mean_hh <- round(mean(hh[[1]]), 1)
sd_hh <- round(sd(hh[[1]]), 1)
len_hh <- nrow(hh)
#
B12P01 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) + 
  geom_histogram(bins = 50, alpha = 0.4, fill = '#FDE725FF') + 
  geom_vline(aes(xintercept = mean_hh), color = '#440154FF') +
  geom_text(data = tibble(x = mean_hh, y = -Inf, 
                          label = paste0("Mean= ", mean_hh)), 
            aes(x = x, y = y, label = label), 
            color = '#440154FF', hjust = -0.5, vjust = 1.3, angle = 90) +
  geom_vline(aes(xintercept = median_hh), color = '#3B528BFF') +
  geom_text(data = tibble(x = median_hh, y = -Inf, 
                          label = paste0("Median= ", median_hh)), 
            aes(x = x, y = y, label = label), 
            color = '#3B528BFF', hjust = -0.5, vjust = -0.7, angle = 90) +
  theme(plot.title.position = "panel") + 
  labs(x = "x", y = "Frequency", 
       subtitle = paste0("(N=", len_hh, "; ", "Mean= ", mean_hh, 
                         "; Median= ", median_hh, "; SD= ", sd_hh,
                         ")"), 
        caption = "B12P01", title = "Flights: Air Time")
}

Density

# #Density Curve
# #Get Quantiles and Ranges of mean +/- sigma 
q05_hh <- quantile(hh[[1]],.05)
q95_hh <- quantile(hh[[1]],.95)
density_hh <- density(hh[[1]])
density_hh_tbl <- tibble(x = density_hh$x, y = density_hh$y)
sig3r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + 3 * sd_hh})
sig3l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - 3 * sd_hh})
sig2r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + 2 * sd_hh}, {x < mean_hh + 3 * sd_hh})
sig2l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - 2 * sd_hh}, {x > mean_hh - 3 * sd_hh})
sig1r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + sd_hh}, {x < mean_hh + 2 * sd_hh})
sig1l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - sd_hh}, {x > mean_hh - 2 * sd_hh})
sig0r_hh <- density_hh_tbl %>% filter(x > mean_hh, {x < mean_hh + 1 * sd_hh})
sig0l_hh <- density_hh_tbl %>% filter(x < mean_hh, {x > mean_hh - 1 * sd_hh})
#
# #Change x-Axis Ticks interval
xbreaks_hh <- seq(-3, 3)
xpoints_hh <- mean_hh + xbreaks_hh * sd_hh
#
# # Latex Labels 
xlabels_hh <- c(TeX(r'($\,\,\mu - 3 \sigma$)'), TeX(r'($\,\,\mu - 2 \sigma$)'), 
                TeX(r'($\,\,\mu - 1 \sigma$)'), TeX(r'($\mu$)'), TeX(r'($\,\,\mu + 1 \sigma$)'), 
                TeX(r'($\,\,\mu + 2 \sigma$)'), TeX(r'($\,\,\mu + 3\sigma$)'))
#
B12P02 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) + 
  geom_density(alpha = 0.2, colour = "#21908CFF") + 
  geom_area(data = sig3l_hh, aes(x = x, y = y), fill = '#440154FF') + 
  geom_area(data = sig3r_hh, aes(x = x, y = y), fill = '#440154FF') + 
  geom_area(data = sig2l_hh, aes(x = x, y = y), fill = '#3B528BFF') + 
  geom_area(data = sig2r_hh, aes(x = x, y = y), fill = '#3B528BFF') + 
  geom_area(data = sig1l_hh, aes(x = x, y = y), fill = '#21908CFF') + 
  geom_area(data = sig1r_hh, aes(x = x, y = y), fill = '#21908CFF') + 
  geom_area(data = sig0l_hh, aes(x = x, y = y), fill = '#5DC863FF') + 
  geom_area(data = sig0r_hh, aes(x = x, y = y), fill = '#5DC863FF') + 
  #scale_y_continuous(limits = c(0, 0.009), breaks = seq(0, 0.009, 0.003)) +
  scale_x_continuous(breaks = xpoints_hh, labels = xlabels_hh) + 
  ggplot2::annotate("segment", x = xpoints_hh[4] - 0.5 * sd_hh, xend = xpoints_hh[2], y = 0.007, 
                    yend = 0.007, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) + 
  ggplot2::annotate("segment", x = xpoints_hh[4] + 0.5 * sd_hh, xend = xpoints_hh[6], y = 0.007, 
                    yend = 0.007, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) + 
  ggplot2::annotate(geom = "text", x = xpoints_hh[4], y = 0.007, label = "95.4%") + 
  theme(plot.title.position = "panel") + 
  labs(x = "x", y = "Density", 
       subtitle = paste0("(N=", nrow(.), "; ", "Mean= ", round(mean(.[[1]]), 1), 
                         "; Median= ", round(median(.[[1]]), 1), "; SD= ", round(sd(.[[1]]), 1),
                         ")"), 
        caption = "B12P02", title = "Flights: Air Time")
}

Aside

  • This section is NOT useful for general reader and can be safely ignored. It contains my notes related to building this book. These are useful only for someone who is building his own book. (Shivam)
  • Side by Side Images need a Caption in Final Chunk
  • LaTex Inside Tex() will not be able to execute braces as usual, avoid them or escape them

4.12 Effect of Sample Size and Repeat Sampling

Using Dataset Flights : “air_time” -Amount of time spent in the air, in minutes.

  1. Effect of increasing sample size (N =100, 1000, 10000), Refer figure 4.2
    • the precision and confidence in the estimate increases and uncertainty decreases
    • the distribution of sample means become thinner. i.e. the sample standard deviation decreases
  2. Effect of increasing the Sampling, Refer figure 4.4
    • The mean of the distribution of sample means equals the mean of the parent distribution.
    • Refer Standard Error

Caution: Trend here does not match with the theory. However, the exercise shows the ‘How to do it’ part. It can be repeated with better data, larger sample size, or repeat sampling.

4.12.1 Sample Size

GIF

Effect of Increasing Sample Size

Figure 4.2 Effect of Increasing Sample Size

Images

Effect of Increasing Sample SizeEffect of Increasing Sample SizeEffect of Increasing Sample Size

Figure 4.3 Effect of Increasing Sample Size

Code

bb <- na.omit(xxflights$air_time)
# #Pseudo Random Number Generation by set.seed() 
set.seed(3)
# #Set Sample Size
#nn <- 100L
# #Take a sample from dataset
xb100 <- sample(bb, size = 100L)
xb1000 <- sample(bb, size = 1000L)
xb10000 <- sample(bb, size = 10000L)
# #Population Mean
mu_hh <- round(mean(bb), 1)
# #Histogram: N = 100
hh <- tibble(ee = xb100)
ylim_hh <- 12.5
caption_hh <- "B12P03"
# #Assumes 'hh' has data in 'ee'. In: mu_hh, caption_hh, ylim_hh
#
B12 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) + 
  geom_histogram(bins = 50, alpha = 0.4, fill = '#FDE725FF') + 
  geom_vline(aes(xintercept = mean(.data[["ee"]])), color = '#440154FF') +
  geom_text(aes(label = TeX(r'($\bar{x}$)', output = "character"), 
                x = mean(.data[["ee"]]), y = -Inf), 
            color = '#440154FF', hjust = 2, vjust = -2.5, parse = TRUE, check_overlap = TRUE) + 
  geom_vline(aes(xintercept = mu_hh), color = '#3B528BFF') +
  geom_text(aes(label = TeX(r'($\mu$)', output = "character"), x = mu_hh, y = -Inf),
            color = '#3B528BFF', hjust = -1, vjust = -2, parse = TRUE, check_overlap = TRUE) + 
  coord_cartesian(xlim = c(0, 800), ylim = c(0, ylim_hh)) + 
  theme(plot.title.position = "panel") + 
  labs(x = "x", y = "Frequency", 
       subtitle = paste0("(Mean= ", round(mean(.[[1]]), 1), 
                         "; SD= ", round(sd(.[[1]]), 1),
                         #"; Var= ", round(var(.[[1]]), 1),
                         "; SE= ", round(sd(.[[1]]) / sqrt(nrow(.)), 1),
                         ")"), 
      caption = caption_hh, title = paste0("Sample Size = ", nrow(.)))
}
assign(caption_hh, B12)
rm(B12)

Warnings

  • “In mean.default(gg) : argument is not numeric or logical: returning NA”
    • For ggplot() - This comes up if the object ‘gg’ is NULL. Check if the ggplot is looking into global scope in place of local dataframe that was passed.

Deprecated

4.12.2 Repeat Sampling

GIF

Effect of Increasing Sample Size

Figure 4.4 Effect of Increasing Sample Size

Images

Effect of Increasing SamplingEffect of Increasing SamplingEffect of Increasing Sampling

Figure 4.5 Effect of Increasing Sampling

Code

bb <- na.omit(xxflights$air_time)
# #Pseudo Random Number Generation by set.seed() 
set.seed(3)
# #Set Sample Size
nn <- 10L
# #Set Repeat Sampling Rate
rr <- 20L
# #Take Sample of N = 10, get mean, repeat i.e. get distribution of mean
xr20 <- replicate(rr, mean(sample(bb, size = nn)))
rr <- 200L
xr200 <- replicate(rr, mean(sample(bb, size = nn)))
rr <- 2000L
xr2000 <- replicate(rr, mean(sample(bb, size = nn)))
#
# #Population Mean
mu_hh <- round(mean(bb), 1)
# #Histogram: N = 10, Repeat = 20
hh <- tibble(ee = xr20)
ylim_hh <- 2
caption_hh <- "B12P06"
# #Assumes 'hh' has data in 'ee'. In: mu_hh, caption_hh, ylim_hh, nn
#
B12 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) + 
  geom_histogram(bins = 50, alpha = 0.4, fill = '#FDE725FF') + 
  geom_vline(aes(xintercept = mean(.data[["ee"]])), color = '#440154FF') +
  geom_text(aes(label = TeX(r'($E(\bar{x})$)', output = "character"), 
                x = mean(.data[["ee"]]), y = -Inf), 
            color = '#440154FF', hjust = 1.5, vjust = -1.5, parse = TRUE, check_overlap = TRUE) + 
  geom_vline(aes(xintercept = mu_hh), color = '#3B528BFF') +
  geom_text(aes(label = TeX(r'($\mu$)', output = "character"), x = mu_hh, y = -Inf),
            color = '#3B528BFF', hjust = -1, vjust = -2, parse = TRUE, check_overlap = TRUE) + 
  coord_cartesian(xlim = c(0, 800), ylim = c(0, ylim_hh)) + 
  theme(plot.title.position = "panel") + 
  labs(x = TeX(r'($\bar{x} \, (\neq x)$)'), y = TeX(r'(Frequency of $\, \bar{x}$)'), 
       subtitle = TeX(sprintf(
         "($\\mu$=%.0f) $E(\\bar{x}) \\, =$%.0f; $\\sigma_{\\bar{x}} \\, =$%.0f",
                             mu_hh, round(mean(.[[1]]), 1), round(sd(.[[1]])))),
       caption = caption_hh, 
       title = paste0("Sampling Distribution (N = ", nn, ") & Repeat Sampling = ", nrow(.)))
}
assign(caption_hh, B12)
rm(B12)

4.13 Normal Distribution

Refer Normal Distribution and equation (11.2)

11.3 A normal distribution (\({\mathcal {N}}_{(\mu,\, \sigma^2)}\)) is a type of continuous probability distribution for a real-valued random variable.

Their importance is partly due to the Central Limit Theorem. Assumption of normal distribution allow us application of Parametric Methods

23.1 Parametric methods are the statistical methods that begin with an assumption about the probability distribution of the population which is often that the population has a normal distribution. A sampling distribution for the test statistic can then be derived and used to make an inference about one or more parameters of the population such as the population mean \({\mu}\) or the population standard deviation \({\sigma}\).

12.15 Central Limit Theorem: In selecting random samples of size \({n}\) from a population, the sampling distribution of the sample mean \({\overline{x}}\) can be approximated by a normal distribution as the sample size becomes large.

It states that, under some conditions, the average of many samples (observations) of a random variable with finite mean and variance is itself a random variable—whose distribution converges to a normal distribution as the number of samples increases.

Parametric statistical tests typically assume that samples come from normally distributed populations, but the central limit theorem means that this assumption is not necessary to meet when you have a large enough sample. A sample size of 30 or more is generally considered large.

This is the basis of Empirical Rule.

8.20 Empirical rule is used to compute the percentage of data values that must be within one, two, and three standard deviations \({\sigma}\) of the mean \({\mu}\) for a normal distribution. These probabilities are Pr(x) 68.27%, 95.45%, and 99.73%.

Caution: If data from small samples do not closely follow this pattern, then other distributions like the t-distribution may be more appropriate.

4.14 Standard Normal Distribution

Refer Standard Normal and equation (11.3)

11.4 A random variable that has a normal distribution with a mean of zero \(({\mu} = 0)\) and a standard deviation of one \(({\sigma} = 1)\) is said to have a standard normal probability distribution. The z-distribution is given by \({\mathcal {z}}_{({\mu} = 0,\, {\sigma} = 1)}\)

The simplest case of a normal distribution is known as the standard normal distribution. Given the Population with normal distribution \({\mathcal {N}}_{(\mu,\, \sigma)}\)

If \(\overline {X}\) is the mean of a sample of size \({n}\) from this population, then the standard error is \(\sigma/{\sqrt{n}}\) and thus the z-score is \(Z=\frac {\overline {X}-\mu }{\sigma/{\sqrt{n}}}\)

The z-score is the test statistic used in a z-test. The z-test is used to compare the means of two groups, or to compare the mean of a group to a set value. Its null hypothesis typically assumes no difference between groups.

The area under the curve to the right of a z-score is the p-value, and it is the likelihood of your observation occurring if the null hypothesis is true.

Usually, a p-value of 0.05 or less means that your results are unlikely to have arisen by chance; it indicates a statistically significant effect.

4.15 Outliers

Refer Outliers

8.21 Sometimes unusually large or unusually small values are called outliers. It is a data point that differs significantly from other observations.

  • Question: If we include a datapoint which is +4 standard deviations away, would we be able to get the Normal Distribution
    • Shape of the curve will be tilted, thus it will be difficult to keep the datapoint and satify the condition for normality
    • Generally, only \({{\mu}-3{\sigma} \leq {x} \leq {\mu}+3{\sigma}}\) values are kept and the remaining are treated as outliers
  • Question: Is is a bad data if it is +4 standard deviations away
    • It means that if we keep the data point, there is a high possibility that we will violate the normality assumption. If we violate the assuption, parametric methods can not be applied to the dataset
    • In general, convert to z-value, remove those which have z-value higher than +3 or lower than -3
  • Question: But, how many removals are too many removals
    • There are techniques for this consideration, will be covered later. “ForLater”
    • (Aside) In a sample of 1000 observations, the presence of up to five observations deviating from the mean by more than three times the standard deviation is within the range of what can be expected. If the sample size is only 100, however, just three such outliers are already reason for concern.
  • Concern: Frequency or Proportion of outliers should not be very high
    • It is true that we cannot have normal distribution in that case.
      • However, can we afford to remove all the data points with \(z > 3\), this needs to be answered in the context of analysis.
        • Here the ‘assignable cause’ is applied. i.e. each datapoint that is proposed to be an outlier is individually analysed and either kept or removed
  • Concern: Sometimes the outliers are present because the dataset is a mixture of two distributions
    • In that case, those should be treated separately
  • Question: Are there tools for all of this jugglery
    • Yes, there are, specially nonparametric methods does not take any assumption about distribution.
    • However, these are not as robust as parametric tests, so if possible, stay with parametric tests

4.16 Type I and Type II Errors

Type-I $(\alpha)$ and Type-II $(\beta)$ Errors

Figure 4.6 Type-I \((\alpha)\) and Type-II \((\beta)\) Errors

Example

  • Type-I “An innocent person is convicted”
  • Type-II “A guilty person is not convicted”

Since we are using sample data to make inferences about the population, it is possible that we will make an error. In the case of the Null Hypothesis, we can make one of two errors.

Refer Type I and Type II Errors

14.7 The error of rejecting \({H_0}\) when it is true, is Type I error \(({\alpha})\).

14.8 The error of accepting \({H_0}\) when it is false, is Type II error \(({\beta})\).

14.9 The level of significance \((\alpha)\) is the probability of making a Type I error when the null hypothesis is true as an equality.

13.3 The confidence level expressed as a decimal value is the confidence coefficient (\(1-{\alpha}\)). i.e. 0.95 is the confidence coefficient for a 95% confidence level.

14.23 The probability of correctly rejecting \({H_0}\) when it is false is called the power of the test. For any particular value of \({\mu}\), the power is \(1 - \beta\).

There is always a tradeoff between Type-I and Type-II errors.

  • Generally max 5% \({\alpha}\) and max 20% \({\beta}\) errors are recommended

In practice, the person responsible for the hypothesis test specifies the level of significance. By selecting \({\alpha}\), that person is controlling the probability of making a Type I error.

  • If the cost of making a Type I error is high, small values of \({\alpha}\) are preferred. Ex: \(\alpha =0.01\)
  • If the cost of making a Type I error is not too high, larger values of \({\alpha}\) are typically used. Ex: \(\alpha = 0.05\)

14.10 Applications of hypothesis testing that only control for the Type I error \((\alpha)\) are called significance tests.

Although most applications of hypothesis testing control for the probability of making a Type I error, they do not always control for the probability of making a Type II error. Because of the uncertainty associated with making a Type II error when conducting significance tests, statisticians usually recommend that we use the statement "do not reject \({H_0}\)" instead of “accept \({H_0}\).”

4.17 Critical Value

Left Tail vs. Right TailLeft Tail vs. Right Tail

Figure 4.7 Left Tail vs. Right Tail

Two Tail

Figure 4.8 Two Tail

14.17 Critical value is the value that is compared with the test statistic to determine whether \({H_0}\) should be rejected. Significance level \({\alpha}\), or confidence level (\(1 - {\alpha}\)), dictates the critical value (\(Z\)), or critical limit. Ex: For Upper Tail Test, \(Z_{{\alpha} = 0.05} = 1.645\).

# #Critical Value (z) for Common Significance level Alpha (α) or Confidence level (1-α)
xxalpha <- c("10%" = 0.1, "5%" = 0.05, "5/2%" = 0.025, "1%" = 0.01, "1/2%" = 0.005)
#
# #Left Tail Test
round(qnorm(p = xxalpha, lower.tail = TRUE), 4)
##     10%      5%    5/2%      1%    1/2% 
## -1.2816 -1.6449 -1.9600 -2.3263 -2.5758
#
# #Right Tail Test
round(qnorm(p = xxalpha, lower.tail = FALSE), 4)
##    10%     5%   5/2%     1%   1/2% 
## 1.2816 1.6449 1.9600 2.3263 2.5758

14.15 A p-value is a probability that provides a measure of the evidence against the null hypothesis provided by the sample. The p-value is used to determine whether the null hypothesis should be rejected. Smaller p-values indicate more evidence against \({H_0}\).

14.18 A acceptance region (confidence interval), is a set of values for the test statistic for which the null hypothesis is accepted. i.e. if the observed test statistic is in the confidence interval then we accept the null hypothesis and reject the alternative hypothesis.

14.20 A rejection region (critical region), is a set of values for the test statistic for which the null hypothesis is rejected. i.e. if the observed test statistic is in the critical region then we reject the null hypothesis and accept the alternative hypothesis.

4.18 Tailed Tests

14.12 A one-tailed test and a two-tailed test are alternative ways of computing the statistical significance of a parameter inferred from a data set, in terms of a test statistic.

One tailed-tests are concerned with one side of a statistic. Whereas, Two-tailed tests deal with both tails of the distribution.

Two-tail test is done when you do not know about direction, so you test for both sides.

14.4 \(\text{\{Left Tail or Lower Tail\} } {H_0} : {\mu} \geq {\mu}_0 \iff {H_a}: {\mu} < {\mu}_0\)

14.5 \(\text{\{Right Tail or Upper Tail\} } {H_0} : {\mu} \leq {\mu}_0 \iff {H_a}: {\mu} > {\mu}_0\)

14.6 \(\text{\{Two Tail\} } {H_0} :{\mu} = {\mu}_0 \iff {H_a}: {\mu} \neq {\mu}_0\)

4.19 Approaches

14.14 The p-value approach uses the value of the test statistic \({z}\) to compute a probability called a p-value.

Steps for the p-value approach or test statistic approach

  • Calculate \(z\) for given \(x\): \(z = \frac{\overline{x} - \mu_0}{s}\)
  • Refer Calculate P(z) by pnorm(), to get the p-value from z-table
    • \(P_{\left(\overline{x}\right)} = P_{\left(z\right)}\)
  • Compare p-value with Level of significance \({\alpha}\)

14.16 The critical value approach requires that we first determine a value for the test statistic called the critical value.

Steps for the critical value approach

  • Calculate \(z\) for given \(x\): \(z = \frac{\overline{x} - \mu_0}{s}\)
  • Using the z-table, find the z for given Level of significance \({\alpha} = 0.01\)
  • Compare test statistic with z-value i.e. \((z)\) vs. \((z_{\alpha = 0.01})\)

4.20 z-test vs. t-test

If the population standard error (SE) is known, apply z-test. If it is unknown, apply t-test. t-test will converge to z-test with increasing sample size.

Question: Does the probability from t-table differ from the probability value from z-table

  • No, practically for sample size greater than 30, there is no difference

It is assumed that \((\overline{x} - \mu)\) follows Normality. However the Standard Error (SE) does not follow normality, generally it follows chi-sq distribution. Thus, \((\overline{x} - \mu)/SE\) becomes ‘Normal/ChiSq’ and this ratio follows the t-distribution. Thus, the test we apply is called t-test.

# #For Degrees of Freedom = 10 (N=11)
# #Critical Value (z) for Common Significance level Alpha (α) or Confidence level (1-α)
xxalpha <- c("10%" = 0.1, "5%" = 0.05, "5/2%" = 0.025, "1%" = 0.01, "1/2%" = 0.005)
dof <- 10L
#
# #Left Tail Test
round(qt(p = xxalpha, df = dof, lower.tail = TRUE), 4)
##     10%      5%    5/2%      1%    1/2% 
## -1.3722 -1.8125 -2.2281 -2.7638 -3.1693
#
# #Right Tail Test
round(qt(p = xxalpha, df = dof, lower.tail = FALSE), 4)
##    10%     5%   5/2%     1%   1/2% 
## 1.3722 1.8125 2.2281 2.7638 3.1693

4.21 t-test

4.21.1 Degrees of Freedom

13.5 The number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary. In general, the degrees of freedom of an estimate of a parameter are \((n - 1)\).

Why \((n-1)\) are the degrees of freedom

  • Degrees of freedom refer to the number of independent pieces of information that go into the computation. i.e. \(\{(x_{1}-\overline{x}), (x_{2}-\overline{x}), \ldots, (x_{n}-\overline{x})\}\)
  • However, \(\sum (x_{i}-\overline{x}) = 0\) for any data set.
  • Thus, only \((n − 1)\) of the \((x_{i}-\overline{x})\) values are independent.
    • if we know \((n − 1)\) of the values, the remaining value can be determined exactly by using the condition.

Question: Is there any minimum sample size we must consider before calculating degrees of freedom

  • Larger sample sizes are needed if the distribution of the population is highly skewed or includes outliers.

Guess: Degrees of freedom is also calculated to remove the possible bias

  • No

4.21.2 How to use t-table

  • Rows have degrees of freedom, Columns have \({\alpha}\) values, get the t-statistic at their intersection
    • For DOF = 10, and \({\alpha} = 0.05\), t-table has value 1.812 (Critical Limit)
    • In right tail test, if the test-statistic is greater than critical limit, we can reject the null

Validation


5 Statistics (B13, Oct-03)

5.1 Overview

  • “Introduction to Statistics”

5.2 Definitions

Type-I $(\alpha)$ and Type-II $(\beta)$ Errors

Figure 5.1 Type-I \((\alpha)\) and Type-II \((\beta)\) Errors

Refer Type I and Type II Errors (B12) & Type I and Type II Errors

14.7 The error of rejecting \({H_0}\) when it is true, is Type I error \(({\alpha})\).

14.8 The error of accepting \({H_0}\) when it is false, is Type II error \(({\beta})\).

14.9 The level of significance \((\alpha)\) is the probability of making a Type I error when the null hypothesis is true as an equality.

13.3 The confidence level expressed as a decimal value is the confidence coefficient (\(1-{\alpha}\)). i.e. 0.95 is the confidence coefficient for a 95% confidence level.

14.23 The probability of correctly rejecting \({H_0}\) when it is false is called the power of the test. For any particular value of \({\mu}\), the power is \(1 - \beta\).

14.10 Applications of hypothesis testing that only control for the Type I error \((\alpha)\) are called significance tests.

14.22 p-value Approach: Form Hypothesis | Specify \({\alpha}\) | Calculate test statistic | Calculate p-value | Compare p-value with \({\alpha}\) | Interpret

5.3 Approaches

Population Size = 100, \({\alpha} = 0.05\)

Hypothesis: \(\text{\{Right Tail or Upper Tail\} } {H_0} : {\mu} \leq 22 \iff {H_a}: {\mu} > 22\)

Sample: n=4, dof = 3, \({\overline{x}} = 23\)

Sample: n=10, dof = 9, \({\overline{x}} = 23\)

We know if we take another sample, we will have a different sample mean. So, we need to confirm whether the above calculated sample mean (\({\overline{x}} = 23\) represent the population mean (\({\mu}\) i.e. Can we reject or fail to reject \({H_0}\) based on this sample!

3 Approaches for Hypothesis Testing -

  1. Test Statistic Approach
    • Fictitious values: Standard Error (SE) = 0.22, so \(t = \frac{23 - 22}{0.22} = 4.545\)
    • For (DOF = 3): \(P_{(t)} = {\alpha} = 0.05\), at \({}^{3}t_{\alpha} = 2.353\)
    • For (DOF = 9): \(P_{(t)} = {\alpha} = 0.05\), at \({}^{9}t_{\alpha} = 1.833\)
    • For both the cases, \({t}\) is greater than \({}^{dof}t_{\alpha}\)
    • Hence null is rejected, the ‘test is statistically significant’
  2. p-value approach
    • Fictitious values: Standard Error (SE) = 0.22, so \(t = \frac{23 - 22}{0.22} = 4.545\)
    • Get \({}^3\!P_{(t = 4.545)} = 0.00997\)
    • Get \({}^9\!P_{(t = 4.545)} = 0.000697\)
    • For both the cases, \(P_{(t)}\) is lower than \({\alpha}\)
    • Hence null is rejected, the ‘test is statistically significant’
  3. Confidence Interval Approach

If the population standard error (SE) is known, apply z-test. If it is unknown, apply t-test. t-test will converge to z-test with increasing sample size.

2-T Rule of Thumb - Skipped “09:55”

Examples

Example:

  1. Question: If we get a z-value of 3.44 (Right Tail), What is the Probability \(P_{(z)}\)
    • For z = 3.44 & Left Tail, p-value = 0.999709 (by ‘pnorm(z)‘)
    • For z = 3.44 & Right Tail, p-value = 2.91^{-4} (by ‘1 - pnorm(z)‘)
    • For z = 3.44 & Right Tail, p-value = 2.91^{-4} (by ‘pnorm(z, lower.tail = FALSE)‘)
  2. Question: If we get a z-value of 4.55 (Right Tail), What is the Probability \(P_{(z)}\)
    • For z = 4.55 & Right Tail, p-value = 0.00000268
  3. Question: If we get a z-value of 1.22 (Right Tail), would we reject the null at \({\alpha} = 0.05\)
    • For z = 1.22 & Right Tail, p-value = 0.11123
    • Because \(P_{(z)}\) is greater than the \({\alpha}\), we fail to reject the null, the ‘test is statistically NOT significant’
  4. Question: If we get a z-value of 1.99 (Right Tail), would we reject the null at \({\alpha} = 0.05\)
    • For z = 1.99 & Right Tail, p-value = 0.023295
    • Because \(P_{(z)}\) is lower than the \({\alpha}\), null is rejected, the ‘test is statistically significant’

Code

# #Get P(z)
z01 <- round(pnorm(3.44), digits = 6)
z02 <- 1 - round(pnorm(3.44), digits = 6)
z03 <- round(pnorm(3.44, lower.tail = FALSE), digits = 6)
z04 <- format(pnorm(4.55, lower.tail = FALSE), digits = 3, scientific = FALSE)
z05 <- format(pnorm(1.22, lower.tail = FALSE), digits = 5)
z06 <- format(pnorm(1.99, lower.tail = FALSE), digits = 5)

Validation


6 Data and Statistics

Definitions and Exercises are from the Book (David R. Anderson 2018)

6.1 Overview

6.2 Introduction

Definition 6.1 Data are the facts and figures collected, analysed, and summarised for presentation and interpretation.
Definition 6.2 Elements are the entities on which data are collected. (Generally ROWS)
Definition 6.3 A variable is a characteristic of interest for the elements. (Generally COLUMNS)
Definition 6.4 The set of measurements obtained for a particular element is called an observation.

Hence, the total number of data items can be determined by multiplying the number of observations by the number of variables.

Definition 6.5 Statistics is the art and science of collecting, analysing, presenting, and interpreting data.

6.3 Scales of Measurement

Data collection requires one of the following scales of measurement: nominal, ordinal, interval, or ratio.
Definition 6.6 The scale of measurement determines the amount of information contained in the data and indicates the most appropriate data summarization and statistical analyses.
Definition 6.7 When the data for a variable consist of labels or names used to identify an attribute of the element, the scale of measurement is considered a nominal scale.

For Example, Gender as Male and Female. In cases where the scale of measurement is nominal, a numerical code as well as a nonnumerical label may be used. For example, 1 denotes Male, 2 denotes Female. The scale of measurement is nominal even though the data appear as numerical values. Only Mode can be calculated.

Definition 6.8 The scale of measurement for a variable is considered an ordinal scale if the data exhibit the properties of nominal data and in addition, the order or rank of the data is meaningful.

For example, Size as small, medium, large. Along with the labels, similar to nominal data, the data can also be ranked or ordered, which makes the measurement scale ordinal. Ordinal data can also be recorded by a numerical code. Median can be calculated but not the Mean.

Definition 6.9 The scale of measurement for a variable is an interval scale if the data have all the properties of ordinal data and the interval between values is expressed in terms of a fixed unit of measure.

Interval data are always numerical. These can be ranked or ordered like ordinal. In addition, the differences between them are meaningful.

Definition 6.10 The scale of measurement for a variable is a ratio scale if the data have all the properties of interval data and the ratio of two values is meaningful.

Variables such as distance, height, weight, and time use the ratio scale of measurement. This scale requires that a zero value be included to indicate that nothing exists for the variable at the zero point. Mean can be calculated.

For example, consider the cost of an automobile. A zero value for the cost would indicate that the automobile has no cost and is free. In addition, if we compare the cost of $30,000 for one automobile to the cost of $15,000 for a second automobile, the ratio property shows that the first automobile is $30,000/$15,000 = 2 times, or twice, the cost of the second automobile.

See Table 6.1 for more details.

6.3.1 Interval scale vs. Ratio scale

Interval scale is a measure of continuous quantitative data that has an arbitrary 0 reference point. This is contrasted with ratio scaled data which have a non-arbitrary 0 reference point. Ex: When we look at “profit” we see that negative profit does make sense to us. So while, the 0 for “profit” is meaningful (just like temperature measurements of Celsius) it is arbitrary. Therefore, it is Interval scale of measurement.

In an interval scale, you can take difference of two values. You may not be able to take ratios of two values. Ex: Temperature in Celsius. You can say that if temperatures of two places are 40 °C and 20 °C, then one is hotter than the other (taking difference). But you cannot say that first is twice as hot as the second (not allowed to take ratio).

In a ratio scale, you can take a ratio of two values. Ex: 40 kg is twice as heavy as 20 kg (taking ratios).

Also, “0” on ratio scale means the absence of that physical quantity. “0” on interval scale does not mean the same. 0 kg means the absence of weight. 0 °C does not mean absence of heat.

Table 6.1: (C01V01) Interval scale Vs Ratio scale
Features Interval scale Ratio scale
Variable property Addition and subtraction Multiplication and Division i.e. calculate ratios. Thus, you can leverage numbers on the scale against 0.
Absolute Point Zero Zero-point in an interval scale is arbitrary. For example, the temperature can be below 0 °C and into negative temperatures. The ratio scale has an absolute zero or character of origin. Height and weight cannot be zero or below zero.
Calculation Statistically, in an interval scale, the Arithmetic Mean is calculated. Statistical dispersion permits range and standard deviation. The coefficient of variation is not permitted. Statistically, in a ratio scale, the Geometric or Harmonic mean is calculated. Also, range and coefficient of variation are permitted for measuring statistical dispersion.
Measurement Interval scale can measure size and magnitude as multiple factors of a defined unit. Ratio scale can measure size and magnitude as a factor of one defined unit in terms of another.
Example Temperature in Celsius, Calendar years and time, Profit These possesses an absolute zero characteristic, like age, weight, height, or Sales

6.4 Categorical and Quantitative Data

Definition 6.11 Data that can be grouped by specific categories are referred to as categorical data. Categorical data use either the nominal or ordinal scale of measurement.
Definition 6.12 Data that use numeric values to indicate ‘how much’ or ‘how many’ are referred to as quantitative data. Quantitative data are obtained using either the interval or ratio scale of measurement.

If the variable is categorical, the statistical analysis is limited. We can summarize categorical data by counting the number of observations in each category or by computing the proportion of the observations in each category. However, even when the categorical data are identified by a numerical code, arithmetic operations do not provide meaningful results.

Arithmetic operations provide meaningful results for quantitative variables. For example, quantitative data may be added and then divided by the number of observations to compute the average value.

Quantitative data may be discrete or continuous.

Definition 6.13 Quantitative data that measure ‘how many’ are discrete.
Definition 6.14 Quantitative data that measure ‘how much’ are continuous because no separation occurs between the possible data values.

6.5 Cross-Sectional and Time Series Data

Definition 6.15 Cross-sectional data are data collected at the same or approximately the same point in time.
Definition 6.16 Time-series data data are data collected over several time periods.

6.6 Observational Study and Experiment

Definition 6.17 In an observational study we simply observe what is happening in a particular situation, record data on one or more variables of interest, and conduct a statistical analysis of the resulting data.
Definition 6.18 The key difference between an observational study and an experiment is that an experiment is conducted under controlled conditions.

As a result, the data obtained from a well-designed experiment can often provide more information as compared to the data obtained from existing sources or by conducting an observational study.

6.7 Caution

  1. Time and Cost - The cost of data acquisition and the subsequent statistical analysis should not exceed the savings generated by using the information to make a better decision.
  2. Data Acquisition Errors - An error in data acquisition occurs whenever the data value obtained is not equal to the true or actual value that would be obtained with a correct procedure. Ex: recording error, misinterpretation etc. Blindly using any data that happen to be available or using data that were acquired with little care can result in misleading information and bad decisions.

6.8 Descriptive Statistics

Definition 6.19 Most of the statistical information is summarized and presented in a form that is easy to understand. Such summaries of data, which may be tabular, graphical, or numerical, are referred to as descriptive statistics.

6.9 Population and Sample

Definition 6.20 A population is the set of all elements of interest in a particular study.
Definition 6.21 A sample is a subset of the population.
Definition 6.22 The measurable quality or characteristic is called a Population Parameter if it is computed from the population. It is called a Sample Statistic if it is computed from a sample.

Refer Sample For More …

6.10 Difference between a population and a sample

The population is the set of entities under study.

  • For example, the mean height of men. (Population “men,” parameter of interest “height”)
    • We choose the population that we wish to study.
    • Typically it is impossible to survey/measure the entire population because not all members are observable.
    • If it is possible to enumerate the entire population it is often costly to do so and would take a great deal of time.

Instead, we could take a subset of this population called a sample and use this sample to draw inferences about the population under study, given some conditions.

  • It is an inference because there will be some uncertainty and inaccuracy involved in drawing conclusions about the population based upon a sample.
    • In Simple Random Sampling (SRS) each member of the population has an equal probability of being included in the sample, hence the term “random.” There are many other sampling methods e.g. stratified sampling, cluster sampling, etc.

6.11 Statistical Inference

Definition 6.23 The process of conducting a survey to collect data for the entire population is called a census.
Definition 6.24 The process of conducting a survey to collect data for a sample is called a sample survey.
Definition 6.25 Statistics uses data from a sample to make estimates and test hypotheses about the characteristics of a population through a process referred to as statistical inference.

Whenever statisticians use a sample to estimate a population characteristic of interest, they usually provide a statement of the quality, or precision, associated with the estimate.

Inferential statistics are used for Hypothesis Testing.

  • It is often used to compare the differences between the treatment groups.
  • It uses measurements from the sample of subjects in the experiment to compare the treatment groups and make generalizations about the larger population of subjects.
  • Most inferential statistics are based on the principle that a test-statistic value is calculated on the basis of a particular formula.
    • That value along with the degrees of freedom, and the rejection criteria are used to determine whether differences exist between the treatment groups.
    • The larger the sample size, the more likely a statistic is to indicate that differences exist between the treatment groups.

The two most common types of Statistical Inference are -

  1. Confidence Intervals
    • To estimate a population parameter
  2. Test of Significance
    • To assess the evidence provided by data about some claim concerning a population
    • i.e. To compare observed data with a claim (Hypothesis)
    • The results of a significance test are expressed in terms of a probability that measures how well the data and the claim agree

Reasoning for Tests of Significance

  • Example: Is the sample mean \({\overline{x}}\) significantly different from population mean \({\mu}\)
  • To determine if two numbers are significantly different, a statistical test must be conducted to provide evidence

6.12 Analytics

Definition 6.26 Analytics is the scientific process of transforming data into insight for making better decisions.

Analytics is used for data-driven or fact-based decision making, which is often seen as more objective than alternative approaches to decision making. The tools of analytics can aid decision making by creating insights from data, improving our ability to more accurately forecast for planning, helping us quantify risk, and yielding better alternatives through analysis.
Analytics is now generally thought to comprise three broad categories of techniques. These categories are descriptive analytics, predictive analytics, and prescriptive analytics.

Definition 6.27 Descriptive analytics encompasses the set of analytical techniques that describe what has happened in the past.

Examples of these types of techniques are data queries, reports, descriptive statistics, data visualization, data dash boards, and basic what-if spreadsheet models.

Definition 6.28 Predictive analytics consists of analytical techniques that use models constructed from past data to predict the future or to assess the impact of one variable on another.

Linear regression, time series analysis, and forecasting models fall into the category of predictive analytics. Simulation, which is the use of probability and statistical computer models to better understand risk, also falls under the category of predictive analytics.

Prescriptive analytics differs greatly from descriptive or predictive analytics. What distinguishes prescriptive analytics is that prescriptive models yield a best course of action to take. That is, the output of a prescriptive model is a best decision.

Definition 6.29 Prescriptive analytics is the set of analytical techniques that yield a best course of action.

Optimization models, which generate solutions that maximize or minimize some objective subject to a set of constraints, fall into the category of prescriptive models.

6.13 Big Data and Data Mining

Definition 6.30 Larger and more complex data sets are now often referred to as big data.

Volume refers to the amount of available data; velocity refers to the speed at which data is collected and processed; and variety refers to the different data types. The term data warehousing is used to refer to the process of capturing, storing, and maintaining the data.

Definition 6.31 Data Mining deals with methods for developing useful decision-making information from large databases. It can be defined as the automated extraction of predictive information from (large) databases.

Data mining relies heavily on statistical methodology such as multiple regression, logistic regression, and correlation.

6.14 Exercises

  • Table: 6.2
  • Table: 6.3
  • Table: 6.4
    • Who appears to be the market share leader and how the market shares are changing over time
      • Caution: Trend Analysis should be done by linear regression with cor(), lm() etc.

R

Load Data

xxComputers <- f_getObject("xxComputers", "C01-Computers.csv", "971fb6096e4f71e8185d3327a9033a10")
xxCordless <- f_getObject("xxCordless", "C01-Cordless.csv", "9991f612fe44f1c890440bd238084679")

f_getObject()

f_getObject <- function(x_name, x_source, x_md = "") {
  # #Debugging
  a07bug <- FALSE
  # #Read the File or Object
  # #Ex: xxCars <- f_getObject("xxCars", "S16-cars2.csv", "30051fb47f65810f33cb992015b849cc")
  # #tools::md5sum("xx.csv") OR tools::md5sum(paste0(.z$XL, "xx", ".txt"))
  #
  # #Path to the File 
  loc_src <- paste0(.z$XL, x_source)
  # #Path to the Object
  loc_rds <- paste0(.z$XL, x_name, ".rds")
  #
  # #x_file[1] FILENAME & x_file[2] FILETYPE
  x_file <- strsplit(x_source, "[.]")[[1]]
  #
  if(all(x_md == tools::md5sum(loc_src),  file.exists(loc_rds),
        file.info(loc_src)$mtime < file.info(loc_rds)$mtime)) {
      # #Read RDS if (exists, newer than source, source not modified i.e. passes md5sum)
      if(a07bug) print("A07 Flag 01: Reading from RDS")
      return(readRDS(loc_rds))
  } else if(!file.exists(loc_src)){
      message("ERROR: File does not exist! : ", loc_src, "\n")
      stop()
  } else if(x_file[2] == "csv") {
      # #Read CSV as a Tibble
      # #col_double(), col_character(), col_logical(), col_integer()
      # #DATETIME (EXCEL) "YYYY-MM-DD HH:MM:SS" imported as "UTC"
      tbl <- read_csv(loc_src, show_col_types = FALSE)
      # #Remove Unncessary Attributes
      attr(tbl, "spec") <- NULL
      attr(tbl, "problems") <- NULL
      # #Write Object as RDS
      saveRDS(tbl, loc_rds)
      # #Return Object
      if(a07bug) print("A07 Flag 02: Reading from Source and Saving as RDS")
      return(tbl)
  } else if(x_file[2] == "xlsx") {
      # #Read All Sheets of Excel in a list
      tbl <- lapply(excel_sheets(loc_src), read_excel, path = loc_src)
      # #Write Object as RDS
      saveRDS(tbl, loc_rds)
      # #Return Object
      return(tbl)
  } else {
      message("f_getObject(): UNKNOWN")
      stop()
  }
}

Transpose Tibble

bb <- tibble(Company = c("Hertz", "Dollar", "Avis"), 
              `2007` = c(327, 167, 204), `2008` = c(311, 140, 220),
              `2009` = c(286, 106, 300), `2010` = c(290, 108, 270))
# #Transpose Tibble: Note that the First Column Header is lost after Transpose
# #Longer
bb %>% pivot_longer(!Company, names_to = "Year", values_to = "Values")
## # A tibble: 12 x 3
##    Company Year  Values
##    <chr>   <chr>  <dbl>
##  1 Hertz   2007     327
##  2 Hertz   2008     311
##  3 Hertz   2009     286
##  4 Hertz   2010     290
##  5 Dollar  2007     167
##  6 Dollar  2008     140
##  7 Dollar  2009     106
##  8 Dollar  2010     108
##  9 Avis    2007     204
## 10 Avis    2008     220
## 11 Avis    2009     300
## 12 Avis    2010     270
# #Transpose
(ii <- bb %>% 
  pivot_longer(!Company, names_to = "Year", values_to = "Values") %>% 
  pivot_wider(names_from = Company, values_from = Values))
## # A tibble: 4 x 4
##   Year  Hertz Dollar  Avis
##   <chr> <dbl>  <dbl> <dbl>
## 1 2007    327    167   204
## 2 2008    311    140   220
## 3 2009    286    106   300
## 4 2010    290    108   270
# #Equivalent
stopifnot(identical(ii, 
                    bb %>% pivot_longer(-1) %>% 
                      pivot_wider(names_from = 1, values_from = value) %>% 
                      rename(., Year = name)))

Computers

Table 6.2: (C01T02) xxComputers
SN tablet cost os display_inch battery_hh cpu
1 Acer Iconia W510 599 Windows 10.1 8.5 Intel
2 Amazon Kindle Fire HD 299 Android 8.9 9.0 TI OMAP
3 Apple iPad 4 499 iOS 9.7 11.0 Apple
4 HP Envy X2 860 Windows 11.6 8.0 Intel
5 Lenovo ThinkPad Tablet 668 Windows 10.1 10.5 Intel
6 Microsoft Surface Pro 899 Windows 10.6 4.0 Intel
7 Motorola Droid XYboard 530 Android 10.1 9.0 TI OMAP
8 Samsung Ativ Smart PC 590 Windows 11.6 7.0 Intel
9 Samsung Galaxy Tab 525 Android 10.1 10.0 Nvidia
10 Sony Tablet S 360 Android 9.4 8.0 Nvidia

Mean

# #What is the average cost for the tablets #$582.90
cat(paste0("Avg. Cost for the tablets is = $", round(mean(bb$cost), digits = 1), "\n"))
## Avg. Cost for the tablets is = $582.9
#
# #Compare the average cost of tablets with different OS (Windows /Android) #$723.20 $428.5
(ii <- bb %>%
  group_by(os) %>%
  summarise(Mean = round(mean(cost), digits =1)) %>%
  arrange(desc(Mean)) %>% 
    mutate(Mean = paste0("$", Mean)))
## # A tibble: 3 x 2
##   os      Mean  
##   <chr>   <chr> 
## 1 Windows $723.2
## 2 iOS     $499  
## 3 Android $428.5
#
cat(paste0("Avg. Cost of Tablets with Windows OS is = ", 
  ii %>% filter(os == "Windows") %>% select(Mean), "\n"))
## Avg. Cost of Tablets with Windows OS is = $723.2

Percentage

# #What percentage of tablets use an Android operating system #40%
(ii <- bb %>%
  group_by(os) %>%
  summarise(PCT = n()) %>%
  mutate(PCT = 100 * PCT / sum(PCT)) %>% 
  arrange(desc(PCT)) %>% 
  mutate(PCT = paste0(PCT, "%")))
## # A tibble: 3 x 2
##   os      PCT  
##   <chr>   <chr>
## 1 Windows 50%  
## 2 Android 40%  
## 3 iOS     10%
#
cat(paste0("Android OS is used in ", 
  ii %>% filter(os == "Android") %>% select(PCT), " Tablets\n"))
## Android OS is used in 40% Tablets

Cordless

Table 6.3: (C01T03) xxCordless
SN brand model price overall_score voice_quality handset_on_base talk_time_hh
1 AT&T CL84100 60 73 Excellent Yes 7
2 AT&T TL92271 80 70 Very Good No 7
3 Panasonic 4773B 100 78 Very Good Yes 13
4 Panasonic 6592T 70 72 Very Good No 13
5 Uniden D2997 45 70 Very Good No 10
6 Uniden D1788 80 73 Very Good Yes 7
7 Vtech DS6521 60 72 Excellent No 7
8 Vtech CS6649 50 72 Very Good Yes 7

Mean

# #What is the average price for the cordless telephones 
cat(paste0("Avg. Price is = $", round(mean(bb$price), digits = 1), "\n"))
## Avg. Price is = $68.1
#
# #What is the average talk time for the cordless telephones
cat(paste0("Avg. Talk Time is = ", round(mean(bb$talk_time_hh), digits = 1), " Hours \n"))
## Avg. Talk Time is = 8.9 Hours

Percentage

# #What percentage of the cordless telephones have a voice quality of excellent 
(hh <- bb %>%
  group_by(voice_quality) %>%
  summarise(PCT = n()) %>%
  mutate(PCT = 100 * PCT / sum(PCT)) %>% 
    mutate(voice_quality = factor(voice_quality, 
                                  levels = c("Very Good", "Excellent"), ordered = TRUE)) %>% 
    arrange(desc(voice_quality)) %>% 
    mutate(PCT = paste0(PCT, "%")))
## # A tibble: 2 x 2
##   voice_quality PCT  
##   <ord>         <chr>
## 1 Excellent     25%  
## 2 Very Good     75%
#
cat(paste0("Percentage of 'Excellent' Voice Quality is = ", 
  hh %>% filter(voice_quality == "Excellent") %>% select(PCT), "\n"))
## Percentage of 'Excellent' Voice Quality is = 25%
#
# #Equivalent
print(bb %>%
 group_by(voice_quality) %>%
 summarise(PCT = n()) %>%
 mutate(PCT = prop.table(PCT) * 100))
## # A tibble: 2 x 2
##   voice_quality   PCT
##   <chr>         <dbl>
## 1 Excellent        25
## 2 Very Good        75

PCT 2

# #What percentage of the cordless telephones have a handset on the base 
bb %>%
  group_by(handset_on_base) %>%
  summarise(PCT = n()) %>%
  mutate(PCT = 100 * PCT / sum(PCT)) %>% 
  arrange(desc(PCT)) %>% 
  mutate(PCT = paste0(PCT, "%")) %>%
  filter(handset_on_base == "Yes") 
## # A tibble: 1 x 2
##   handset_on_base PCT  
##   <chr>           <chr>
## 1 Yes             50%

Cars

Transform

Table 6.4: (C01T04) Cars in Service
Company 2007 2008 2009 2010
Hertz 327 311 286 290
Dollar 167 140 106 108
Avis 204 220 300 270
Table 6.4: (C01T04B) Cars (Transposed)
Year Hertz Dollar Avis
2007 327 167 204
2008 311 140 220
2009 286 106 300
2010 290 108 270
bb <- tibble(Company = c("Hertz", "Dollar", "Avis"), 
              `2007` = c(327, 167, 204), `2008` = c(311, 140, 220),
              `2009` = c(286, 106, 300), `2010` = c(290, 108, 270))
# #Transpose Tibble: Note that the First Column Header is lost after Transpose
# #Longer
hh <- bb %>% pivot_longer(!Company, names_to = "Year", values_to = "Values")
# #Transpose
ii <- bb %>% 
  pivot_longer(!Company, names_to = "Year", values_to = "Values") %>% 
  pivot_wider(names_from = Company, values_from = Values)

TimeSeries

# #Save an Image
ggsave(paste0(.z$PX, "C01P01_Cars_TimeSeries", ".png"), plot = C01P01)
# #Load an Image
knitr::include_graphics(paste0(.z$PX, "C01P01_Cars_TimeSeries", ".png"))
Multiple Time Series Graph

Figure 6.1 Multiple Time Series Graph

Rowwise

# #who appears to be the market share leader
# #how the market shares are changing over time
print(ii)
## # A tibble: 4 x 4
##   Year  Hertz Dollar  Avis
##   <chr> <dbl>  <dbl> <dbl>
## 1 2007    327    167   204
## 2 2008    311    140   220
## 3 2009    286    106   300
## 4 2010    290    108   270
# #Row Total
jj <- ii %>% rowwise() %>% mutate(SUM = sum(c_across(where(is.numeric)))) %>% ungroup()
kk <- ii %>% mutate(SUM = rowSums(across(where(is.numeric))))
stopifnot(identical(jj, kk))
#
# #Rowwise Percentage Share 
ii %>% 
  rowwise() %>% 
  mutate(SUM = sum(c_across(where(is.numeric)))) %>% 
  ungroup() %>%
  mutate(across(2:4, ~ round(. * 100 / SUM, digits = 1), .names = "{.col}.{.fn}")) %>%
  mutate(across(ends_with(".1"), ~ paste0(., "%")))
## # A tibble: 4 x 8
##   Year  Hertz Dollar  Avis   SUM Hertz.1 Dollar.1 Avis.1
##   <chr> <dbl>  <dbl> <dbl> <dbl> <chr>   <chr>    <chr> 
## 1 2007    327    167   204   698 46.8%   23.9%    29.2% 
## 2 2008    311    140   220   671 46.3%   20.9%    32.8% 
## 3 2009    286    106   300   692 41.3%   15.3%    43.4% 
## 4 2010    290    108   270   668 43.4%   16.2%    40.4%

Pareto

# #Bar Plot
aa <- bb %>% 
  select(Company, `2010`) %>% 
  rename("Y2010" = `2010`) %>% 
  arrange(desc(.[2])) %>% 
  mutate(cSUM = cumsum(Y2010)) %>%
  mutate(PCT = 100 * Y2010 / sum(Y2010)) %>% 
  mutate(cPCT = 100 * cumsum(Y2010) / sum(Y2010)) %>% 
  mutate(across(Company, factor, levels = unique(Company),ordered = TRUE))
# #
pareto_chr <- setNames(c(aa$Y2010), aa$Company)
stopifnot(identical(pareto_chr, aa %>% pull(Y2010, Company)))
stopifnot(identical(pareto_chr, aa %>% select(1:2) %>% deframe()))
# #Plot Pareto
C01P02 <- pareto.chart(pareto_chr, 
             xlab = "Company", ylab = "Cars",
             # colors of the chart             
             #col=heat.colors(length(pareto_chr)), 
             # ranges of the percentages at the right
             cumperc = seq(0, 100, by = 20),  
             # label y right
             ylab2 = "Cumulative Percentage", 
             # title of the chart
             main = "Pareto Chart"
)
Pareto of Cars in 2010

Figure 6.2 Pareto of Cars in 2010

Validation


7 Descriptive Statistics

7.1 Overview

7.2 Summarizing Data for a Categorical Variable

Definition 7.1 A frequency distribution is a tabular summary of data showing the number (frequency) of observations in each of several non-overlapping categories or classes.

The relative frequency of a class equals the fraction or proportion of observations belonging to a class i.e. it is out of 1 whereas ‘percent frequency’ is out of 100%.

Rather than showing the frequency of each class, the cumulative frequency distribution shows the number of data items with values less than or equal to the upper class limit of each class.

  • Bar Chart
    • Pareto Chart - ggplot() does not allow easy setup of dual axis
    • Stacked Bar Chart - do not use it if there are more than 2 categories
  • Pie Chart
    • Only use it if total is 100% and Categories are fewer than 5 or 6.

Bar & Pie

Table 7.1: (C02T02) Frequency Distribution
softdrink Frequency cSUM PROP PCT cPCT
Coca-Cola 19 19 38 38% 38%
Pepsi 13 32 26 26% 64%
Diet Coke 8 40 16 16% 80%
Dr. Pepper 5 45 10 10% 90%
Sprite 5 50 10 10% 100%
Bar Chart and Pie Chart of FrequencyBar Chart and Pie Chart of Frequency

Figure 7.1 Bar Chart and Pie Chart of Frequency

Data

# #Frequency Distribution
aa <- tibble(softdrink = c("Coca-Cola", "Diet Coke", "Dr. Pepper", "Pepsi", "Sprite"), 
             Frequency = c(19, 8, 5, 13, 5))
#
# #Sort, Cummulative Sum, Percentage, and Cummulative Percentage
bb <- aa %>% 
  arrange(desc(Frequency)) %>% 
  mutate(cSUM = cumsum(Frequency)) %>%
  mutate(PROP = 100 * Frequency / sum(Frequency)) %>% 
  mutate(PCT = paste0(PROP, "%")) %>% 
  mutate(cPCT = paste0(100 * cumsum(Frequency) / sum(Frequency), "%"))

Bar

# #Sorted Bar Chart of Frequencies (Needs x-axis as Factor for proper sorting)
C02P01 <- bb %>% mutate(across(softdrink, factor, levels = rev(unique(softdrink)))) %>% {
  ggplot(data = ., aes(x = softdrink, y = Frequency)) +
  geom_bar(stat = 'identity', aes(fill = softdrink)) + 
  scale_y_continuous(sec.axis = sec_axis(~ (. / sum(bb$Frequency))*100, name = "Percentages", 
                       labels = function(b) { paste0(round(b, 0), "%")})) +
  geom_text(aes(label = paste0(Frequency, "\n(", PCT, ")")), vjust = 2, 
            colour = c(rep("black", 2), rep("white", nrow(bb)-2))) +
  k_gglayer_bar +   
  labs(x = "Soft Drinks", y = "Frequency", subtitle = NULL, 
         caption = "C02P01", title = "Bar Chart of Categorical Data")
}

Pie

# #Pie Chart of Frequencies (Needs x-axis as Factor for proper sorting)
C02P02 <- bb %>% mutate(across(softdrink, factor, levels = unique(softdrink))) %>% {
  ggplot(data = ., aes(x = '', y = Frequency, fill = rev(softdrink))) +
  geom_bar(stat = 'identity', width = 1, color = "white") +
  coord_polar(theta = "y", start = 0) +
  geom_text(aes(label = paste0(softdrink, "\n", Frequency, " (", PCT, ")")), 
            position = position_stack(vjust = 0.5), 
            colour = c(rep("black", 2), rep("white", nrow(bb)-2))) +
  k_gglayer_pie +   
  labs(caption = "C02P02", title = "Pie Chart of Categorical Data")
}

f_theme_gg()

f_theme_gg <- function(base_size = 14) {
# #Create a Default Theme 
  theme_bw(base_size = base_size) %+replace%
    theme(
      # #The whole figure
      plot.title = element_text(size = rel(1), face = "bold", 
                                margin = margin(0,0,5,0), hjust = 0),
      # #Area where the graph is located
      panel.grid.minor = element_blank(),
      panel.border = element_blank(),
      # #The axes
      axis.title = element_text(size = rel(0.85), face = "bold"),
      axis.text = element_text(size = rel(0.70), face = "bold"),
#      arrow = arrow(length = unit(0.3, "lines"), type = "closed"),
      axis.line = element_line(color = "black"),
      # The legend
      legend.title = element_text(size = rel(0.85), face = "bold"),
      legend.text = element_text(size = rel(0.70), face = "bold"),
      legend.key = element_rect(fill = "transparent", colour = NA),
      legend.key.size = unit(1.5, "lines"),
      legend.background = element_rect(fill = "transparent", colour = NA),
      # Labels in the case of facetting
      strip.background = element_rect(fill = "#17252D", color = "#17252D"),
      strip.text = element_text(size = rel(0.85), face = "bold", color = "white", margin = margin(5,0,5,0))
    )
}
# #Change default ggplot2 theme 
theme_set(f_theme_gg()) 
#
# #List of Specific sets. Note '+' is replaced by ','
k_gglayer_bar <- list(
  scale_fill_viridis_d(),
  theme(panel.grid.major.x = element_blank(), axis.line = element_blank(),
        panel.border = element_rect(colour = "black", fill=NA, size=1),
        legend.position = 'none', axis.title.y.right = element_blank())
)
#
# #Pie
k_gglayer_pie <- list(
  scale_fill_viridis_d(),
  #theme_void(),
  theme(#panel.background = element_rect(fill = "white", colour = "white"),
        #plot.background = element_rect(fill = "white",colour = "white"),
        axis.line = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_blank(),
        #panel.border = element_rect(colour = "black", fill=NA, size=1),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        legend.position = 'none')
)
#
# #Histogram
k_gglayer_hist <- list(
  scale_fill_viridis_c(direction = -1, alpha = 0.9),
  theme(panel.grid.major.x = element_blank(), axis.line.y = element_blank(),
        panel.border = element_blank(), axis.ticks.y = element_blank(), 
        legend.position = 'none')
)
#
# #Scatter Plot Trendline
k_gglayer_scatter <- list(
  scale_fill_viridis_d(alpha = 0.9),
  theme(panel.grid.minor = element_blank(),
        panel.border = element_blank())
)
#
# #BoxPlot
k_gglayer_box <- list(
  scale_fill_viridis_d(alpha = 0.9),
  theme(panel.grid.major = element_line(colour = "#d3d3d3"),
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        panel.background = element_blank(), panel.grid.major.x = element_blank(),
        #plot.title = element_text(size = 14, family = "Tahoma", face = "bold"),
        #text=element_text(family = "Tahoma"),
        #axis.title = element_text(face="bold"),
        #axis.text.x = element_text(colour="black", size = 11),
        #axis.text.y = element_text(colour="black", size = 9),
        axis.line = element_line(size=0.5, colour = "black"))
)

Errors

ERROR 7.1 Error: stat_count() can only have an x or y aesthetic.

Solution: Use geom_bar(stat = "identity")

7.3 Summarizing Data for a Quantitative Variable

A histogram is used for continuous data, where the bins represent ranges of data, while a bar chart is a plot of categorical variables.

The three steps necessary to define the classes for a frequency distribution with quantitative data are

  1. Determine the number of nonoverlapping classes (Bins)
    • Classes are formed by specifying ranges that will be used to group the data.
    • Approx. 5-20
  2. Determine the width of each class
    • The bins are usually specified as consecutive, non-overlapping intervals of a variable.
    • The bins (intervals) must be adjacent and are often (but not required to be) of equal size.
    • Approx. Bin Width = (Max - Min) / Number of Bins
    • Ex: For a dataset with min =12 & max =33, 5 bins of 10-14, …, 30-34 can be selected
  3. Determine the class
    • Class limits must be chosen so that each data item belongs to one and only one class
    • For categorical data, this was not required because each item naturally fell into a separate class
    • But with quantitative data, class limits are necessary to determine where each data value belongs
    • The ‘class midpoint’ is the value halfway between the lower and upper class limits. For a Bin of 10-14, 12 will be its mid-point.
  • Dot Plot
    • A horizontal axis shows the range for the data. Each data value is represented by a dot placed above the axis.
    • Caution: Avoid! Y-Axis is deceptive.
  • Histogram
    • Unlike a bar graph, a histogram contains no natural separation between the rectangles of adjacent classes.
  • Stem-and-Leaf Display (Not useful)

Histogram

set.seed(3)
# #Get Normal Data
bb <- tibble(aa = rnorm(n = 10000)) 
# #Histogram
# # '..count..' or '..x..'
C02P03 <- bb %>% {
  ggplot(data = ., aes(x = aa, fill = ..count..)) + 
  geom_histogram(bins = 50, position = "identity") +    
  k_gglayer_hist +
  labs(x = "Normal Data", y = "Count", subtitle = paste0("n = ", format(nrow(.), big.mark=",")), 
       caption = "C02P03", title = "Histogram")
}
geom_histogram(): Histogram

Figure 7.2 geom_histogram(): Histogram

Dot Plot

# #Random Data
aa <- c(26, 35, 22, 47, 37, 5, 50, 49, 42, 2, 8, 7, 4, 47, 44, 35, 17, 49, 1, 48, 
        1, 27, 13, 26, 18, 44, 31, 4, 23, 47, 38, 28, 28, 5, 35, 39, 29, 13, 17, 
        38, 1, 8, 3, 30, 18, 37, 29, 39, 7, 28)
bb <- tibble(aa)
# #Dot Chart of Frequencies
C02P04 <- bb %>% {
  ggplot(., aes(x = aa)) +
  geom_dotplot(binwidth = 5, method = "histodot") + 
  theme(axis.line.y = element_blank(), panel.grid = element_blank(), axis.text.y = element_blank(),
        axis.ticks.y = element_blank(), axis.title.y =  element_blank()) + 
  labs(x = "Bins", subtitle = "Caution: Avoid! Y-Axis is deceptive.", 
       caption = "C02P04", title = "Dot Plot")
}
geom_dotplot(): Frequency Dot Chart

Figure 7.3 geom_dotplot(): Frequency Dot Chart

Get Frequency

7.4 Summarizing Data for Two Variables Using Tables

Definition 7.2 A crosstabulation is a tabular summary of data for two variables. It is used to investigate the relationship between them. Generally, one of the variable is categorical.
  • Simpson Paradox
    • The reversal of conclusions based on aggregate and unaggregated data is called Simpson paradox.
    • Ex: Table 7.2 shows the count of judgements that were ‘upheld’ or ‘reversed’ on appeal for two judges
      • 86% of the verdicts were upheld for Judge Abel, while 88% of the verdicts were upheld for Judge Ken. From this aggregated crosstabulation, we conclude that Judge Ken is doing the better job because a greater percentage of his verdicts are being upheld.
      • However, unaggregated crosstabulations show that in both types of courts (Common, Municipal) Judge Abel has higher percentage of ‘Upheld’ Verdicts (90.6% and 84.7%) - compared to Judge Ken (90% and 80%)
      • Thus, Abel has a better record because a greater percentage of his verdicts are being upheld in both courts.
      • This reversal of conclusions based on aggregated and unaggregated data illustrates Simpson paradox.
    • Cause
      • Note that for both judges the percentage of appeals that resulted in reversals was much higher in ‘Municipal’ than in ‘Common’ Court i.e. 15.3% vs. 9.4% for Abel and 20% vs. 10% for Ken.
      • Because Judge Abel tried a much higher percentage of his cases in ‘Municipal,’ the aggregated data favoured Judge Ken i.e. 118/150 for Abel vs. 25/125 for Ken.
      • Thus, for the original crosstabulation, we see that the ‘type of court’ is a hidden variable that cannot be ignored when evaluating the records of the two judges.
Table 7.2: (C02T01) Both Judges
Judge_Verdict xUpheld xReversed SUM
Abel 129 (86%) 21 (14%) 150
Ken 110 (88%) 15 (12%) 125
Total 239 (86.9%) 36 (13.1%) 275
Table 7.2: (C02T01A) Abel
Abel xUpheld xReversed SUM
Common 29 (90.6%) 3 (9.4%) 32
Municipal 100 (84.7%) 18 (15.3%) 118
Total 129 (86%) 21 (14%) 150
Table 7.2: (C02T01B) Ken
Ken xUpheld xReversed SUM
Common 90 (90%) 10 (10%) 100
Municipal 20 (80%) 5 (20%) 25
Total 110 (88%) 15 (12%) 125

Judges

# #Judges: Because we are evaluating 'Judges', they are the 'elements' and thus are the 'rows'
xxJudges <- tibble(Judge_Verdict = c('Abel', 'Ken'), Upheld = c(129, 110), Reversed = c(21, 15))
# #Uaggregated crosstab for both Judges in different types of Courts
xxKen <- tibble(Ken = c("Common", "Municipal "), 
                    Upheld = c(90, 20), Reversed = c(10, 5))
xxAbel <- tibble(Abel = c("Common", "Municipal "), 
                    Upheld = c(29, 100), Reversed = c(3, 18))

Transpose

# #Judges
aa <- tibble(Judge_Verdict = c('Abel', 'Ken'), Upheld = c(129, 110), Reversed = c(21, 15))
bb <- tibble(Verdict_Judge = c('Upheld', 'Reversed'), Abel = c(129, 21), Ken = c(110, 15))
aa
## # A tibble: 2 x 3
##   Judge_Verdict Upheld Reversed
##   <chr>          <dbl>    <dbl>
## 1 Abel             129       21
## 2 Ken              110       15
# #Transpose, Assuming First Column Header has "Row_Col" Type Format
ii <- aa %>% 
  `attr<-`("ColsLost", unlist(strsplit(names(.)[1], "_"))[1]) %>% 
  `attr<-`("RowsKept", unlist(strsplit(names(.)[1], "_"))[2]) %>% 
  pivot_longer(c(-1), 
               names_to = paste0(attributes(.)$RowsKept, "_", attributes(.)$ColsLost), 
               values_to = "Values") %>% 
  pivot_wider(names_from = 1, values_from = Values) %>% 
  `attr<-`("ColsLost", NULL) %>% `attr<-`("RowsKept", NULL) 
stopifnot(identical(bb, ii))
ii
## # A tibble: 2 x 3
##   Verdict_Judge  Abel   Ken
##   <chr>         <dbl> <dbl>
## 1 Upheld          129   110
## 2 Reversed         21    15
# #Testing for Reverse
ii <- bb %>% 
  `attr<-`("ColsLost", unlist(strsplit(names(.)[1], "_"))[1]) %>% 
  `attr<-`("RowsKept", unlist(strsplit(names(.)[1], "_"))[2]) %>% 
  pivot_longer(c(-1), 
               names_to = paste0(attributes(.)$RowsKept, "_", attributes(.)$ColsLost), 
               values_to = "Values") %>% 
  pivot_wider(names_from = 1, values_from = Values) %>% 
  `attr<-`("ColsLost", NULL) %>% `attr<-`("RowsKept", NULL) 
stopifnot(identical(aa, ii))

String Split

bb <- "Judge_Verdict"
# #Split String by strsplit(), output is list
(ii <- unlist(strsplit(bb, "_")))
## [1] "Judge"   "Verdict"
#
# #Split on Dot 
bb <- "Judge.Verdict"
# #Using character classes
ii <- unlist(strsplit(bb, "[.]"))
# #By escaping special characters
jj <- unlist(strsplit(bb, "\\."))
# #Using Options
kk <- unlist(strsplit(bb, ".", fixed = TRUE))
stopifnot(all(identical(ii, jj), identical(ii, kk)))

Attributes

jj <- ii <- bb <- aa
# #attr() adds or removes an attribute
attr(bb, "NewOne") <- "abc"
# #Using Backticks
ii <- `attr<-`(ii, "NewOne", "abc")
# #Using Pipe
jj <- jj %>% `attr<-`("NewOne", "abc")
#
stopifnot(all(identical(bb, ii), identical(bb, jj)))
#
# #List Attributes
names(attributes(bb))
## [1] "class"     "row.names" "names"     "NewOne"
#
# #Specific Attribute Value
attributes(bb)$NewOne
## [1] "abc"
#
# #Remove Attributes
attr(bb, "NewOne") <- NULL
ii <- `attr<-`(ii, "NewOne", NULL)
jj <- jj %>% `attr<-`("NewOne", NULL)
stopifnot(all(identical(bb, ii), identical(bb, jj)))

Total Row

# #(Deprecated) Issues: 
# #(1) bind_rows() needs two dataframes. Thus, first can be skipped in Pipe, But...
# #The second dataframe can not be replaced with dot (.), it has to have a name
# #(2) Pipe usage inside function call was working but was a concern
# #(3) It introduced NA for that replace was needed as another step
ii <- aa %>% bind_rows(aa %>% summarise(across(where(is.numeric), sum))) %>%
    mutate(across(1, ~ replace(., . %in% NA, "Total"))) 
#
# #(Deprecated) Works but needs ALL Column Names individually
jj <- aa %>% add_row(Judge_Verdict = "Total", Upheld = sum(.[, 2]), Reversed = sum(.[, 3]))
kk <- aa %>% add_row(Judge_Verdict = "Total", Upheld = sum(.$Upheld), Reversed = sum(.$Reversed))
#
# #(Deprecated) Removed the Multiple call to sum(). However, it needs First Column Header Name
ll <- aa %>% add_row(Judge_Verdict = "Total", summarise(., across(where(is.numeric), sum)))
# #(Deprecated) Replaced Column Header Name using "Tilde"
mm <- aa %>% add_row(summarise(., across(where(is.character), ~"Total")), 
               summarise(., across(where(is.numeric), sum)))
stopifnot(all(identical(ii, jj), identical(ii, kk), identical(ii, ll), identical(ii, mm)))
#
# #(Working): Minimised
aa %>% add_row(summarise(., across(1, ~"Total")), summarise(., across(where(is.numeric), sum)))
## # A tibble: 3 x 3
##   Judge_Verdict Upheld Reversed
##   <chr>          <dbl>    <dbl>
## 1 Abel             129       21
## 2 Ken              110       15
## 3 Total            239       36

Replace NA

# # USE '%in%' for NA, otherwise '==' works
aa %>% bind_rows(aa %>% summarise(across(where(is.numeric), sum))) %>%
    mutate(across(1, ~ replace(., . %in% NA, "Total"))) 
## # A tibble: 3 x 3
##   Judge_Verdict Upheld Reversed
##   <chr>          <dbl>    <dbl>
## 1 Abel             129       21
## 2 Ken              110       15
## 3 Total            239       36
#
#   #Replace NA in a Factor
aa %>% bind_rows(aa %>% summarise(across(where(is.numeric), sum))) %>% 
  mutate(Judge_Verdict = factor(Judge_Verdict)) %>% 
  mutate(across(1, fct_explicit_na, na_level = "Total"))
## # A tibble: 3 x 3
##   Judge_Verdict Upheld Reversed
##   <fct>          <dbl>    <dbl>
## 1 Abel             129       21
## 2 Ken              110       15
## 3 Total            239       36

To Factor

#   #Convert to Factor
aa %>% mutate(Judge_Verdict = factor(Judge_Verdict))
## # A tibble: 2 x 3
##   Judge_Verdict Upheld Reversed
##   <fct>          <dbl>    <dbl>
## 1 Abel             129       21
## 2 Ken              110       15

Clipboard

# #Paste but do not execute
aa <- read_delim(clipboard())
# #Copy Excel Data, then execute the above command
#
# #Print its structure
dput(aa)
# #Copy the relevant values, headers in tibble()
bb <- tibble(  )
# #The above command will be the setup to generate this tibble anywhere

7.5 Exercise

C02E27

Data

ex27 <- tibble(Observation = 1:30, 
             x = c("A", "B", "B", "C", "B", "C", "B", "C", "A", "B", "A", "B", "C", "C", "C", 
                   "B", "C", "B", "C", "B", "C", "B", "C", "A", "B", "C", "C", "A", "B", "B"), 
             y = c(1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 1, 2, 2, 2, 
                   2, 1, 1, 1, 1, 2, 1, 2, 1, 1, 2, 2, 1, 1, 2))

CrossTab

bb <- ex27
str(bb)
## tibble [30 x 3] (S3: tbl_df/tbl/data.frame)
##  $ Observation: int [1:30] 1 2 3 4 5 6 7 8 9 10 ...
##  $ x          : chr [1:30] "A" "B" "B" "C" ...
##  $ y          : num [1:30] 1 1 1 2 1 2 1 2 1 1 ...
# #Create CrossTab
bb <- bb %>% 
  count(x, y) %>% 
  pivot_wider(names_from = y, values_from = n, values_fill = 0)

PCT

bb
## # A tibble: 3 x 3
##   x       `1`   `2`
##   <chr> <int> <int>
## 1 A         5     0
## 2 B        11     2
## 3 C         2    10
# #Rowwise Percentage in Separate New Columns
bb %>% 
  mutate(SUM = rowSums(across(where(is.numeric)))) %>% 
  mutate(across(where(is.numeric), ~ round(. * 100 /SUM, 1), .names = "{.col}_Row" )) 
## # A tibble: 3 x 7
##   x       `1`   `2`   SUM `1_Row` `2_Row` SUM_Row
##   <chr> <int> <int> <dbl>   <dbl>   <dbl>   <dbl>
## 1 A         5     0     5   100       0       100
## 2 B        11     2    13    84.6    15.4     100
## 3 C         2    10    12    16.7    83.3     100
#
# #Rowwise Percentage in Same Columns
bb %>% 
  mutate(SUM = rowSums(across(where(is.numeric)))) %>% 
  mutate(across(where(is.numeric), ~ round(. * 100 /SUM, 1))) 
## # A tibble: 3 x 4
##   x       `1`   `2`   SUM
##   <chr> <dbl> <dbl> <dbl>
## 1 A     100     0     100
## 2 B      84.6  15.4   100
## 3 C      16.7  83.3   100
#
# #Equivalent
bb %>% 
  mutate(SUM = rowSums(across(where(is.numeric))),
         across(where(is.numeric), ~ round(. * 100 /SUM, 1))) 
## # A tibble: 3 x 4
##   x       `1`   `2`   SUM
##   <chr> <dbl> <dbl> <dbl>
## 1 A     100     0     100
## 2 B      84.6  15.4   100
## 3 C      16.7  83.3   100
#
# #Columnwise Percentage in Separate New Columns
bb %>% 
  mutate(across(where(is.numeric), ~ round(. * 100 /sum(.), 1), .names = "{.col}_Col" ))
## # A tibble: 3 x 5
##   x       `1`   `2` `1_Col` `2_Col`
##   <chr> <int> <int>   <dbl>   <dbl>
## 1 A         5     0    27.8     0  
## 2 B        11     2    61.1    16.7
## 3 C         2    10    11.1    83.3
# #Columnwise Percentage in Same Columns
bb %>% 
  mutate(across(where(is.numeric), ~ round(. * 100 /sum(.), 1)))
## # A tibble: 3 x 3
##   x       `1`   `2`
##   <chr> <dbl> <dbl>
## 1 A      27.8   0  
## 2 B      61.1  16.7
## 3 C      11.1  83.3

C02E28

Data

ex28 <- tibble(Observation = 1:20, 
        x = c(28, 17, 52, 79, 37, 71, 37, 27, 64, 53, 13, 84, 59, 17, 70, 47, 35, 62, 30, 43), 
        y = c(72, 99, 58, 34, 60, 22, 77, 85, 45, 47, 98, 21, 32, 81, 34, 64, 68, 67, 39, 28))

CrossTab

bb <- ex28
# #Rounding to the lowest 10s before min and to the highest 10s after max
nn <- 10L   
n_x <- seq(floor(min(bb$x) / nn) * nn, ceiling(max(bb$x) / nn) * nn, by = 20)
n_y <- seq(floor(min(bb$y) / nn) * nn, ceiling(max(bb$y) / nn) * nn, by = 20)
#
# #Labels in the format of [10-29]
lab_x <- paste0(n_x, "-", n_x +20 -1) %>% head(-1)
lab_y <- paste0(n_y, "-", n_y +20 -1) %>% head(-1)

# #Wider Table without Totals
ii <- bb %>% 
  mutate(x_bins = cut(x, breaks = n_x, right = FALSE, labels = lab_x),
         y_bins = cut(y, breaks = n_y, right = FALSE, labels = lab_y)) %>% 
  count(x_bins, y_bins) %>% 
  pivot_wider(names_from = y_bins, values_from = n, values_fill = 0, names_sort = TRUE)
print(ii)
## # A tibble: 4 x 5
##   x_bins `20-39` `40-59` `60-79` `80-99`
##   <fct>    <int>   <int>   <int>   <int>
## 1 10-29        0       0       1       4
## 2 30-49        2       0       4       0
## 3 50-69        1       3       1       0
## 4 70-89        4       0       0       0
# #Cross Tab with Total Column and Total Row
jj <- ii %>% 
  bind_rows(ii %>% summarise(across(where(is.numeric), sum))) %>% 
    mutate(across(1, fct_explicit_na, na_level = "Total")) %>% 
    mutate(SUM = rowSums(across(where(is.numeric))))
print(jj)
## # A tibble: 5 x 6
##   x_bins `20-39` `40-59` `60-79` `80-99`   SUM
##   <fct>    <int>   <int>   <int>   <int> <dbl>
## 1 10-29        0       0       1       4     5
## 2 30-49        2       0       4       0     6
## 3 50-69        1       3       1       0     5
## 4 70-89        4       0       0       0     4
## 5 Total        7       3       6       4    20

Cut

# #Group Continuous Data to Categorical Bins by base::cut()
bb <- ex28
#
# #NOTE cut() increases the range slightly but ggplot functions do not
bb %>% mutate(x_bins = cut(x, breaks = 8)) %>% 
  pull(x_bins) %>% levels()
## [1] "(12.9,21.9]" "(21.9,30.8]" "(30.8,39.6]" "(39.6,48.5]" "(48.5,57.4]" "(57.4,66.2]"
## [7] "(66.2,75.1]" "(75.1,84.1]"
# 
# #By default, it excludes the lower range, but it can be included by option
bb %>% mutate(x_bins = cut(x, breaks = 8, include.lowest = TRUE)) %>% 
  pull(x_bins) %>% levels()
## [1] "[12.9,21.9]" "(21.9,30.8]" "(30.8,39.6]" "(39.6,48.5]" "(48.5,57.4]" "(57.4,66.2]"
## [7] "(66.2,75.1]" "(75.1,84.1]"
#
# #ggplot::cut_interval() makes n groups with equal range. There is a cut_number() also
bb %>% mutate(x_bins = cut_interval(x, n = 8)) %>% 
  pull(x_bins) %>% levels()
## [1] "[13,21.9]"   "(21.9,30.8]" "(30.8,39.6]" "(39.6,48.5]" "(48.5,57.4]" "(57.4,66.2]"
## [7] "(66.2,75.1]" "(75.1,84]"
#
# #Specific Bins
bb %>% mutate(x_bins = cut(x, breaks = seq(10, 90, by = 10))) %>% 
  pull(x_bins) %>% levels()
## [1] "(10,20]" "(20,30]" "(30,40]" "(40,50]" "(50,60]" "(60,70]" "(70,80]" "(80,90]"
ii <- bb %>% mutate(x_bins = cut(x, breaks = seq(10, 90, by = 10), include.lowest = TRUE)) %>% 
  pull(x_bins) %>% levels()
print(ii)
## [1] "[10,20]" "(20,30]" "(30,40]" "(40,50]" "(50,60]" "(60,70]" "(70,80]" "(80,90]"
#
# #ggplot::cut_width() makes groups of width
bb %>% mutate(x_bins = cut_width(x, width = 10)) %>% 
  pull(x_bins) %>% levels()
## [1] "[5,15]"  "(15,25]" "(25,35]" "(35,45]" "(45,55]" "(55,65]" "(65,75]" "(75,85]"
#
# #Match cut_width() and cut()
jj <- bb %>% mutate(x_bins = cut_width(x, width = 10, boundary = 0)) %>% 
  pull(x_bins) %>% levels()
print(jj)
## [1] "[10,20]" "(20,30]" "(30,40]" "(40,50]" "(50,60]" "(60,70]" "(70,80]" "(80,90]"
stopifnot(identical(ii, jj))
#
# #Labelling
n_breaks <- seq(10, 90, by = 10)
n_labs <- paste0("*", n_breaks, "-", n_breaks + 10) %>% head(-1)

bb %>% mutate(x_bins = cut(x, breaks = n_breaks, include.lowest = TRUE, labels = n_labs)) %>% 
  pull(x_bins) %>% levels()
## [1] "*10-20" "*20-30" "*30-40" "*40-50" "*50-60" "*60-70" "*70-80" "*80-90"

7.6 Summarizing Data for Two Variables

  • Scatterplot and Trendline
  • Side by Side and Stacked Bar Charts

Data

xxCommercials <- tibble(Week = 1:10, 
                 Commercials = c(2, 5, 1, 3, 4, 1, 5, 3, 4, 2), 
                 Sales = c(50, 57, 41, 54, 54, 38, 63, 48, 59, 46))
f_setRDS(xxCommercials)
geom_point(), geom_smooth(), & stat_poly_eq()

Figure 7.4 geom_point(), geom_smooth(), & stat_poly_eq()

Trendline

bb <- xxCommercials 

# #Define the formula for Trendline calculation
k_gg_formula <- y ~ x
#
# #Scatterplot, Trendline alongwith its equation and R2 value
C02P05 <- bb %>% {
  ggplot(data = ., aes(x = Commercials, y = Sales)) + 
  geom_smooth(method = 'lm', formula = k_gg_formula, se = FALSE) +
  stat_poly_eq(aes(label = paste0("atop(", ..eq.label.., ", \n", ..rr.label.., ")")), 
               formula = k_gg_formula, eq.with.lhs = "italic(hat(y))~`=`~",
               eq.x.rhs = "~italic(x)", parse = TRUE) +
  geom_point() + 
  labs(x = "Commercials", y = "Sales ($100s)", 
       subtitle = paste0("Trendline equation and R", '\u00b2', " value"), 
       caption = "C02P05", title = "Scatter Plot")
}

Validation


8 Numerical Measures

8.1 Overview

8.2 Definitions (Ref)

6.22 The measurable quality or characteristic is called a Population Parameter if it is computed from the population. It is called a Sample Statistic if it is computed from a sample.

8.3 Number Theory

Definition 8.1 A number is a mathematical object used to count, measure, and label. Their study or usage is called arithmetic, a term which may also refer to number theory, the study of the properties of numbers.

Individual numbers can be represented by symbols, called numerals; for example, “5” is a numeral that represents the ‘number five.’

As only a relatively small number of symbols can be memorized, basic numerals are commonly organized in a numeral system, which is an organized way to represent any number. The most common numeral system is the Hindu–Arabic numeral system, which allows for the representation of any number using a combination of ten fundamental numeric symbols, called digits.

Counting is the process of determining the number of elements of a finite set of objects, i.e., determining the size of a set. Enumeration refers to uniquely identifying the elements of a set by assigning a number to each element.

Measurement is the quantification of attributes of an object or event, which can be used to compare with other objects or events.

Sets

Formally, \(\mathbb{N} \to \mathbb{Z} \to \mathbb{Q} \to \mathbb{R} \to \mathbb{C}\)
Practically, \(\mathbb{N} \subset \mathbb{Z} \subset \mathbb{Q} \subset \mathbb{R} \subset \mathbb{C}\)

The natural numbers \(\mathbb{N}\) are those numbers used for counting and ordering. ISO standard begin the natural numbers with 0, corresponding to the non-negative integers \(\mathbb{N} = \{0, 1, 2, 3, \ldots \}\), whereas others start with 1, corresponding to the positive integers \(\mathbb{N^*} = \{1, 2, 3, \ldots \}\)

The set of integers \(\mathbb{Z}\) consists of zero (\({0}\)), the positive natural numbers \(\{1, 2, 3, \ldots \}\) and their additive inverses (the negative integers). Thus i.e., \(\mathbb{Z} = \{\ldots, -3, -2, -1, 0, 1, 2, 3, \ldots \}\). An integer is colloquially defined as a number that can be written without a fractional component.

Rational numbers \(\mathbb{Q}\) are those which can be expressed as the quotient or fraction p/q of two integers, a numerator p and a non-zero denominator q. Thus, Rational Numbers \(\mathbb{Q} = \{1, 2, 3, \ldots \}\)

A real number is a value of a continuous quantity that can represent a distance along a line. The real numbers include all the rational numbers \(\mathbb{Q}\), and all the irrational numbers. Thus, Real Numbers \(\mathbb{R} = \mathbb{Q} \cup \{\sqrt{2}, \sqrt{3}, \ldots\} \cup \{ \pi, e, \phi, \ldots \}\)

The complex numbers \(\mathbb{C}\) contain numbers which are expressed in the form \(a + ib\), where \({a}\) and \({b}\) are real numbers. These have two components the real numbers and a specific element denoted by \({i}\) (imaginary unit) which satisfies the equation \(i^2 = −1\).

Pi

The number Pi \(\pi = 3.14159\ldots\) is defined as the ratio of circumference of a circle to its diameter.

\[\pi = \int _{-1}^{1} \frac{dx}{\sqrt{1- x^2}} \tag{8.1}\]

\[e^{i\varphi}=\cos \varphi +i\sin \varphi \tag{8.2}\]

\[e^{i\pi} + 1 =0 \tag{8.3}\]

# #Read OIS File for 20000 PI digits including integral (3) and fractional (14159...)
# #md5sum = "daf0b33a67fd842a905bb577957a9c7f"
tbl <- read_delim(file = paste0(.z$XL, "PI-OIS-b000796.txt"), 
  delim = " ", col_names = c("POS", "VAL"), col_types = list(POS = "i", VAL = "i"))
attr(tbl, "spec") <- NULL
attr(tbl, "problems") <- NULL
xxPI <- tbl
f_setRDS(xxPI)

e

Euler Number \(e = 2.71828\ldots\), is the base of the natural logarithm.

\[e = \lim_{n \to \infty} \left(1 + \frac{1}{n} \right)^{n} = \sum \limits_{n=0}^{\infty} \frac{1}{n!} \tag{8.4}\]

Phi

Two quantities are in the golden ratio \(\varphi = 1.618\ldots\) if their ratio is the same as the ratio of their sum to the larger of the two quantities.

\[\varphi^2 - \varphi -1 =0 \\ \varphi = \frac{1+\sqrt{5}}{2} \tag{8.5}\]

Groups

Definition 8.2 A prime number is a natural number greater than 1 that is not a product of two smaller natural numbers. A natural number greater than 1 that is not prime is called a ‘composite number.’ 1 is neither a Prime nor a composite, it is a ‘Unit.’ Thus, by definition, Negative Integers and Zero cannot be Prime.
Definition 8.3 Parity is the property of an integer \(\mathbb{Z}\) of whether it is even or odd. It is even if the integer is divisible by 2 with no remainders left and it is odd otherwise. Thus, -2, 0, +2 are even but -1, 1 are odd. Numbers ending with 0, 2, 4, 6, 8 are even. Numbers ending with 1, 3, 5, 7, 9 are odd.
Definition 8.4 An integer \(\mathbb{Z}\) is positive if it is greater than zero, and negative if it is less than zero. Zero is defined as neither negative nor positive.
Definition 8.5 Mersenne primes are those prime number that are of the form \((2^n -1)\); that is, \(\{3, 7, 31, 127, \ldots \}\)

Mersenne primes:

  • \(\{3, 7, 31, 127, 8191, 131071, 524287, 2147483647, 2305843009213693951, \ldots \}\)
  • \(\{3 (2^{nd}), 7(4^{th}), 31(11^{th}), 127(31^{st}), 8191 (1028^{th}), 131071 (12,251^{th}), 524287 (43,390^{th}), \ldots \}\)
    • Mersenne primes with their position in List of Primes
  • \(2147483647 = (2^{231} − 1)\)
    • It is 105,097,565\(^{th}\) Prime, \(8^{th}\) Mersenne prime and is one of only four known double Mersenne primes.
    • It represents the largest value that a signed 32-bit integer field can hold.

8.4 Primes

Empty Vector

# #Create empty Vector with NA
aa <- rep(NA_integer_, 10)
print(aa)
##  [1] NA NA NA NA NA NA NA NA NA NA

f_isPrime()

f_isPrime <- function(x) {
  # #Check if the number is Prime
  if(!is.integer(x)) {
    cat("Error! Integer required. \n")
    stop()
  } else if(x <= 0L) {
    cat("Error! Positive Integer required. \n")
    stop()
  } else if(x > 2147483647L) {
    cat(paste0("Doubles are stored as approximation. Prime will not be calculated for value higher than '2147483647' \n"))
    stop()
  }
  # #However, this checks the number against ALL Smaller values including non-primes
  if(x == 2L || all(x %% 2L:ceiling(sqrt(x)) != 0)) {
    # # "seq.int(3, ceiling(sqrt(x)), 2)" is slower
    return(TRUE)
  } else {
    ## (any(x %% 2L:ceiling(sqrt(x)) == 0))
    ## (any(x %% seq.int(3, ceiling(sqrt(x)), 2) == 0))
    ## NOTE Further, if sequence starts from 3, add 2 also as a Prime Number
    return(FALSE)
  }
}
# #Vectorise Version
f_isPrimeV <- Vectorize(f_isPrime)
# #Compiled Version
f_isPrimeC <- cmpfun(f_isPrime)

Primes

# #There are 4 Primes in First 10, 25 in 100, 168 in 1000, 1229 in 10000.
# # Using Vectorise Version, get all the Primes
aa <- 1:10
bb <- aa[f_isPrimeV(aa)]
ii <- f_getPrimeUpto(10)
stopifnot(identical(bb, ii))
# #
xxPrime10 <- c(2, 3, 5, 7) |> as.integer()
# #
xxPrime100 <- c(2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 
               53, 59, 61, 67, 71, 73, 79, 83, 89, 97)  |> as.integer()
#
# #Generate List of ALL Primes till 524287 (i.e. Total 43,390 Primes)
xxPrimes <- f_getPrimeUpto(524287L)
# #Save as RDS
f_setRDS(xxPrimes)

Large Integers

# #NOTE: Assigning 2147483647L causes the Chunk to throw Warnings even with 'eval=FALSE'.
if(FALSE){
# #Assignment of 2305843009213693951L is NOT possible without Warning
# #Even within non-executing Block or with 'eval=FALSE' or suppressWarnings() or tryCatch()
# #It cannot be stored as integer, thus it is automatically converted to double
  #bb <- 2305843009213693951L
# #Warning: non-integer value 2305843009213693951L qualified with L; using numeric value 
# #NOTE that the value changed. It is explicitly NOT a prime anymore.
  #print(aa, digits = 20)
# #[1] 2305843009213693952
#
# #Assignment of 2147483647L is possible and direct printing in console works BUT
# #Its printing will also throw Warnings that are difficult to handle
# #Avoid Printing. Even within non-executing Block, it is affecting R Bookdown.
  aa <- 2147483647L
  #print(aa)
}

f_getPrime()

f_getPrimeUpto <- function(x){
  # #Get a Vector of Primes upto the given Number (Max. 524287)
  if(x < 2) {
    print("NOT ALLOWED!")
    return(NULL)
  } else if(x > 524287){
    print("Sadly, beyond this number it is difficult to generate the List of Primes!")
    return(NULL)
  }
  y <- 2:x
  i <- 1
  while (y[i] <= sqrt(x)) {
    y <-  y[y %% y[i] != 0 | y == y[i]]
    i <- i+1
  }
  return(y)
}

Benchmark

# #Compare any number of functions
result <- microbenchmark(
  sum(1:100)/length(1:100), 
  mean(1:100),
  #times = 1000,
  check = 'identical'
)
# #Print Table
print(result)
##Unit: microseconds
##                     expr min    lq    mean median     uq    max neval cld
## sum(1:100)/length(1:100) 1.2 1.301 1.54795 1.5005 1.6005  7.501   100  a 
##              mean(1:100) 5.9 6.001 6.56989 6.1010 6.2010 28.001   100   b
#
# #Boxplot of Benchmarking Result
#autoplot(result)
# #Above testcase showed a surprising result of sum()/length() being much faster than mean()
#
# #Or Compare Plot Rendering
if(FALSE) microbenchmark(print(jj), print(kk), print(ll), times = 2)

Sum-Mean

“ForLater” - Include rowsum(), rowSums(), colSums(), rowMeans(), colMeans() in this also.

# #Conclusion: use mean() because precision is difficult to achieve compared to speed
#
# #sum()/length() is faster than mean()
# #However, mean() does double pass, so it would be more accurate
# #mean.default() and var() compute means with an additional pass and so are more accurate
# #e.g. the variance of a constant vector is (almost) always zero 
# #and the mean of such a vector will be equal to the constant value to machine precision.
aa <- 1:100
#
microbenchmark(
  sum(aa)/length(aa), 
  mean(aa),
  mean.default(aa),
  .Internal(mean(aa)),
  #times = 1000,
  check = 'identical'
)
## Unit: nanoseconds
##                 expr  min   lq mean median   uq   max neval cld
##   sum(aa)/length(aa)  900 1000 1102   1000 1100  6100   100 a  
##             mean(aa) 5300 5500 5898   5600 5800 17900   100   c
##     mean.default(aa) 2800 3000 3306   3100 3200 22200   100  b 
##  .Internal(mean(aa)) 1000 1100 1241   1200 1200  7100   100 a
# #rnorm() generates random deviates of given length
set.seed(3)
aa <- rnorm(1e7)
str(aa)
##  num [1:10000000] -0.962 -0.293 0.259 -1.152 0.196 ...
#
# #NOTE manual calculation and mean() is NOT matching
identical(sum(aa)/length(aa), mean(aa))
## [1] FALSE
#
# #There is a slight difference
sum(aa)/length(aa) - mean(aa)
## [1] 2.355429e-17

Remove Objects

if(FALSE) {
  # #Remove all objects matching a pattern
  rm(list = ls(pattern = "f_"))
}

Options Memory

# #Check the Current Options Value
getOption("expressions")
## [1] 5000
if(FALSE) {
  # #Change Value
  # #NOTE it did not help when recursive function failed
  # #Error: node stack overflow
  # #Error during wrapup: node stack overflow
  # #Error: no more error handlers available ...
  options(expressions=10000)
}

Vectorize()

# #To Vectorise a Function
f_isPrimeV <- Vectorize(f_isPrime)

Compiling

# #To Pre-Compile a Function for faster performance
f_isPrimeC <- cmpfun(f_isPrime)

Profiling

# #To Profile a Function Calls for improvements
Rprof("file.out")
f_isPrime(2147483647L)
#f_getPrimesUpto(131071L)
Rprof(NULL)
summaryRprof("file.out")

Legacy A

# #Functions to check for PRIME - All of them have various problems
# #"-3L -2L -1L 0L 1L 8L" FALSE "2L 3L ... 524287L 2147483647L" TRUE
isPrime_a <- function(x) {
  # #Fails for "2147483647L" Error: cannot allocate vector of size 8.0 Gb
  if (x == 2L) {
    return(TRUE)
  } else if (any(x %% 2:(x-1) == 0)) {
    return(FALSE)
  } else return(TRUE)
}

isPrime_b <- function(x){
  # #Comparison of Division and Integer Division by 1,2,...,x
  # #Fails for "2147483647L" Error: cannot allocate vector of size 16.0 Gb
  # #Fails for "-ve and zero" Error: missing value where TRUE/FALSE needed
  # vapply(x, function(y) sum(y / 1:y == y %/% 1:y), integer(1L)) == 2L
  if(sum(x / 1:x == x %/% 1:x) == 2) {
    return(TRUE) 
  } else return(FALSE)
}

isPrime_c <- function(x) {
  # #RegEx Slowest: Iit converts -ve values and coerce non-integers which may result in bugs
  x <- abs(as.integer(x))
  if(x > 8191L) {
    print("Do not run this with large values. RegEx is really slow.")
    stop()
  }
  !grepl('^1?$|^(11+?)\\1+$', strrep('1', x))
}

isPrime_d <- function(x) {
  # #Fails for "1" & returns TRUE
  # #Fails for "-ve and zero" Error: NA/NaN argument
  if(x == 2L || all(x %% 2L:max(2, floor(sqrt(x))) != 0)) {
    return(TRUE)
  } else return(FALSE)
}

isPrime_e <- function(x) {
  # #Fails for "-ve and zero" Error: NA/NaN argument
  # #This is the most robust which can be improved by conditional check for positive integers
  # #However, this checks the number against ALL Smaller values including non-primes
  if(x == 2L || all(x %% 2L:ceiling(sqrt(x)) != 0)) {
    # # "seq.int(3, ceiling(sqrt(x)), 2)" is slower
    return(TRUE)
  } else {
    ## (any(x %% 2L:ceiling(sqrt(x)) == 0))
    ## (any(x %% seq.int(3, ceiling(sqrt(x)), 2) == 0))
    ## NOTE Further, if sequence starts from 3, add 2 also as a Prime Number
    return(FALSE)
  }
}

Legacy B

# #131071 (12,251th), 524287 (43,390th), 2147483647 (105,097,565th)
aa <- 1:131071
# #Following works but only till 524287L, Memory Overflow ERROR for 2147483647L
bb <- aa[f_isPrimeV(aa)]

getPrimeUpto_a <- function(x){
  # #Extremely Slow, can not go beyond 8191L in benchmark testing
  if(x < 2) return("ERROR")
  y <- 2:x
  primes <- rep(2L, x)
  j <- 1L
  for (i in y) {
    if (!any(i %% primes == 0)) {
      j <- j + 1L
      primes[j] <- i
      #cat(paste0("i=", i, ", j=", j, ", Primes= ",paste0(head(primes, j), collapse = ", ")))
    }
    #cat("\n")
  }
  result <- head(primes, j)
  #str(result)
  #cat(paste0("Head: ", paste0(head(result), collapse = ", "), "\n"))
  #cat(paste0("Tail: ", paste0(tail(result), collapse = ", "), "\n"))
  return(result)
}

getPrimeUpto_b <- function(x){
# #https://stackoverflow.com/questions/3789968/
  # #This is much faster even from the "aa[f_isPrimeV(aa)]"
    if(x < 2) return("ERROR")
    y <- 2:x
    i <- 1
    while (y[i] <= sqrt(x)) {
        y <-  y[y %% y[i] != 0 | y == y[i]]
        i <- i+1
    }
    result <- y
    #str(result)
    #cat(paste0("Head: ", paste0(head(result), collapse = ", "), "\n"))
    #cat(paste0("Tail: ", paste0(tail(result), collapse = ", "), "\n"))
    return(result)
}

getPrimeUpto_c <- function(x) {
  # #Problems and Slow
  # #Returns a Vetor of Primes till the Number i.e. f_getPrimesUpto(7) = (2, 3, 5, 7)
  # #NOTE: f_getPrimesUpto(1) and f_getPrimesUpto(2) both return "2"
  if(!is.integer(x)) {
    cat("Error! Integer required. \n")
    stop()
  } else if(!identical(1L, length(x))) {
    cat("Error! Unit length vector required. \n")
    stop()
  } else if(x <= 0L) {
    cat("Error! Positive Integer required. \n")
    stop()
  } else if(x > 2147483647) {
    cat(paste0("Doubles are stored as approximation. Prime will not be calculated for value higher than '2147483647' \n"))
    stop()
  }
  
  # #Can not create vector of length 2147483647L and also not needed that many
  # #ceiling(sqrt(7L)) return 3, however we need length 4 (2, 3, 5, 7)
  # #So, added extra 10
  #primes <- rep(NA_integer_, 10L + sqrt(2L))
  primes <- rep(2L, 10L + sqrt(2L))
  j <- 1L
  primes[j] <- 2L
  #
  i <- 2L
  while(i <= x) {
    # #na.omit() was the slowest step, so changed all NA to 2L in the primes
    #k <- na.omit(primes[primes <= ceiling(sqrt(i))])
    k <- primes[primes <= ceiling(sqrt(i))]
    if(all(as.logical(i %% k))) {
      j <- j + 1
      primes[j] <- i
    }  
    # #Increment with INTEGER Addition
    i = i + 1L
  }
  result <- primes[complete.cases(primes)]
  str(result)
  cat(paste0("Head: ", paste0(head(result), collapse = ", "), "\n"))
  cat(paste0("Tail: ", paste0(tail(result), collapse = ", "), "\n"))
  return(result)
}

getPrimeUpto_d <- function(n = 10L, i = 2L, primes = c(2L), bypass = TRUE){
  # #Using Recursion is NOT a good solution
  # #Function to return N Primes upto 1000 Primes (7919) or Max Value reaching 10000.
  if(i > 10000){
    cat("Reached 10000 \n")
    return(primes)
  }
  if(bypass) {
    maxN <- 1000L
    if(!is.integer(n)) {
      cat("Error! Integer required. \n")
      stop()
    } else if(!identical(1L, length(n))) {
      cat("Error! Unit length vector required. \n")
      stop()
    } else if(n <= 0L) {
      cat("Error! Positive Integer required. \n")
      stop()
    } else if(n > maxN) {
      cat(paste0("Error! This will calculate only upto ", maxN, " prime Numebers. \n"))
      stop()
    }
  }
  if(length(primes) < n) {
    if(all(as.logical(i %% primes[primes <= ceiling(sqrt(i))]))) {
      # #Coercing 0 to FALSE, Non-zero Values to TRUE
      # # "i %% 2L:ceiling(sqrt(i))" checks i agains all integers till sqrt(i)
      # # "primes[primes <= ceiling(sqrt(i))]" checks i against only the primes till sqrt(i)
      # #However, the above needs hardcoded 2L as prime so the vector is never empty
      # #Current Number is Prime, so include it in the vector and check the successive one
      f_getPrime(n, i = i+1, primes = c(primes, i), bypass = FALSE)
    } else {
      # #Current Number is NOT Prime, so check the successive one
      f_getPrime(n, i = i+1, primes = primes, bypass = FALSE)
    }
  } else {
    # #Return the vector when it reaches the count
    return(primes)
  }
}

8.5 Measures of Location

8.5.1 Mean

Definition 8.6 Given a data set \({X=\{x_1,x_2,\ldots,x_n\}}\), the mean \({\overline{x}}\) is the sum of all of the values \({x_1,x_2,\ldots,x_n}\) divided by the count \({n}\).
  • Refer equation (8.6)
    • Sample mean is denoted by \({\overline{x}}\) (x bar) and Population mean is denoted by \({\mu}\).
    • Mean is the most commonly used measure of central location, even though it is influenced by extreme values.

\[\bar{x} = \frac{1}{n}\left (\sum_{i=1}^n{x_i}\right ) = \frac{x_1+x_2+\cdots +x_n}{n} \tag{8.6}\]

In the mean calculation, normally each \({x_i}\) is given equal importance or weightage of \({1/n}\). However, in some instances the mean is computed by giving each observation a weight that reflects its relative importance. A mean computed in this manner is referred to as the weighted mean, as given in equation (8.7)

\[\bar{x} = \frac{\sum_{i=1}^n{w_ix_i}}{\sum_{i=1}^n{w_i}} \tag{8.7}\]

Caution: Unit of mean is same as unit of the variable e.g. cost_per_kg thus ‘w’ would be ‘kg.’

Mean

aa <- 1:10
# #Mean of First 10 Numbers
mean(aa)
## [1] 5.5

More

aa <- 1:10
# #Mean of First 10 Numbers
ii <- mean(aa)
print(ii)
## [1] 5.5
jj <- sum(aa)/length(aa)
stopifnot(identical(ii, jj))
#
# #Mean of First 10 Prime Numbers (is neither Prime nor Integer)
mean(f_getRDS(xxPrimes)[1:10])
## [1] 12.9
#
# #Mean of First 100 Digits of PI
f_getRDS(xxPI)[1:100, ] %>% pull(VAL) %>% mean()
## [1] 4.71

Weighted Mean

aa <- tibble(Purchase = 1:5, cost_per_kg = c(3, 3.4, 2.8, 2.9, 3.25), 
             kg = c(1200, 500, 2750, 1000, 800))
# #NOTE that unit of mean is same as unit of the variable e.g. cost_per_kg thus 'w' would be 'kg'
(ii <- sum(aa$cost_per_kg * aa$kg)/sum(aa$kg))
## [1] 2.96
jj <- with(aa, sum(cost_per_kg * kg)/sum(kg))
kk <- weighted.mean(x = aa$cost_per_kg, w = aa$kg)
stopifnot(all(identical(ii, jj), identical(ii, kk)))

8.5.2 Median

Definition 8.7 Median of a population is any value such that at most half of the population is less than the proposed median and at most half is greater than the proposed median.
  • Refer equation (8.8)
    • The median is the value in the middle when the data is sorted
    • For an odd number of observations, the median is the middle value.
    • For an even number of observations, the median is the average of the two middle values.
    • Although the mean is the more commonly used measure of central location, whenever a data set contains extreme values, the median is preferred.
      • The mean and median are different concepts and answer different questions.
        • Ex: Income - nearly always reported as median, but if we are looking the the ‘spending power of whole community’ it may no not be right.
    • The median is well-defined for any ordered data, and is independent of any distance metric.
      • The median can thus be applied to classes which are ranked but not numerical (ordinal), although the result might be halfway between classes if there is an even number of cases.

\[\begin{align} \text{if n is odd, } median(x) & = x_{(n+1)/2} \\ \text{if n is even, } median(x) & = \frac{x_{(n/2)}+x_{(n/2)+1}}{2} \end{align} \tag{8.8}\]

Median

aa <- 1:10 
# #Median of First 10 Numbers
median(aa)
## [1] 5.5

More

aa <- 1:10 
# #Median of First 10 Numbers
median(aa)
## [1] 5.5
#
# #Median of First 10 Prime Numbers (is NOT prime)
median(f_getRDS(xxPrimes)[1:10])
## [1] 12
#
# #Median of First 100 Digits of PI
f_getRDS(xxPI)[1:100, ] %>% pull(VAL) %>% median()
## [1] 4.5

8.5.3 Geometric Mean

Definition 8.8 The geometric mean \(\overline{x}_g\) is a measure of location that is calculated by finding the n^{th} root of the product of \({n}\) values.
  • Refer equation (8.9)
    • The geometric mean applies only to positive numbers
    • The geometric mean is often used for a set of numbers whose values are meant to be multiplied together or are exponential in nature
    • For all positive data sets containing at least one pair of unequal values, the harmonic mean is always the least of the three means, while the arithmetic mean is always the greatest of the three and the geometric mean is always in between.

\[\overline{x}_g = \left(\prod _{i=1}^{n} x_i\right)^{\frac{1}{n}} = \sqrt[{n}]{x_1 x_2 \ldots x_n} \tag{8.9}\]

Geometric Mean

aa <- 1:10
# #Geometric Mean of of First 10 Numbers
exp(mean(log(aa)))
## [1] 4.528729

More

aa <- 1:10
# #Geometric Mean of of First 10 Numbers
ii <- exp(mean(log(aa)))
jj <- prod(aa)^(1/length(aa))
stopifnot(identical(ii, jj))
#
# #Geometric Mean of First 10 Prime Numbers 
exp(mean(log(f_getRDS(xxPrimes)[1:10])))
## [1] 9.573889

8.5.4 Mode

Definition 8.9 The mode is the value that occurs with greatest frequency.
  • The median makes sense when there is a linear order on the possible values. Unlike median, the concept of mode makes sense for any random variable assuming values from a vector space.

Mode

# #Mode of First 100 Digits of PI
bb <- f_getRDS(xxPI)[1:100, ] %>% pull(VAL)
f_getMode(bb)
## [1] 9

More

# #Mode of First 100 Digits of PI
bb <- f_getRDS(xxPI)[1:100, ]
#
# #Get Frequency
bb %>% count(VAL)
## # A tibble: 10 x 2
##      VAL     n
##    <int> <int>
##  1     0     8
##  2     1     8
##  3     2    12
##  4     3    12
##  5     4    10
##  6     5     8
##  7     6     9
##  8     7     8
##  9     8    12
## 10     9    13
#
# #Get Mode
bb %>% pull(VAL) %>% f_getMode()
## [1] 9

f_getMode()

f_getMode <- function(x) {
  # #Calculate Statistical Mode
  # #NOTE: Single Length, All NA, Characters etc. have NOT been validated
  # #https://stackoverflow.com/questions/56552709
  # #https://stackoverflow.com/questions/2547402
  # #Remove NA
  if (anyNA(x)) {
    x <- x[!is.na(x)]
  }
  # #Get Unique Values
  ux <- unique(x)
  # #Match
  ux[which.max(tabulate(match(x, ux)))]
}

8.5.5 Percentiles

Definition 8.10 A percentile provides information about how the data are spread over the interval from the smallest value to the largest value. For a data set containing \({n}\) observations, the \(p^{th}\) percentile divides the data into two parts: approximately p% of the observations are less than the \(p^{th}\) percentile, and approximately (100 – p)% of the observations are greater than the \(p^{th}\) percentile.
  • Refer equation (8.10)
    • Percentile is the value which divides the data into two groups when it is sorted
    • Quartiles are specific percentiles of 25%, 50% and 75%
    • Median is 50% percentile
    • Caution: Excel “PERCENTILE.EXC” calculations match with type =6 option of quantile(), default is type =7

\[L_p = \frac{p}{100}(n+1) \tag{8.10}\]

Percentiles

# #First 100 Digits of PI
bb <- f_getRDS(xxPI)[1:100, ]
#
# #50% Percentile of Digits i.e. Median
quantile(bb$VAL, 0.5)
## 50% 
## 4.5

More

# #First 100 Digits of PI
bb <- f_getRDS(xxPI)[1:100, ]
#
# #50% Percentile of Digits i.e. Median
ii <- quantile(bb$VAL, 0.5)
print(ii)
## 50% 
## 4.5
jj <- median(bb$VAL)
stopifnot(identical(unname(ii), jj))
# 
# #All Quartiles
quantile(bb$VAL, seq(0, 1, 0.25))
##   0%  25%  50%  75% 100% 
## 0.00 2.00 4.50 7.25 9.00
# #summary()
summary(bb$VAL)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    2.00    4.50    4.71    7.25    9.00
#
# #To Match with Excel "PERCENTILE.EXC" use type=6 in place of default type=7
quantile(bb$VAL, seq(0, 1, 0.25), type = 6)
##   0%  25%  50%  75% 100% 
## 0.00 2.00 4.50 7.75 9.00

8.6 Measures of Variability

In addition to measures of location, it is often desirable to consider measures of variability, or dispersion.

  • Range range()
    • (Largest value - Smallest value) i.e. max() - min()
    • Range is based on only two of the observations and thus is highly influenced by extreme values.
  • Interquartile Range (IQR) IQR()
    • The difference between the third quartile, and the first quartile
    • It overcomes the dependency on extreme values
  • Mean Absolute Error (MAE)
    • \(MAE = \frac{\sum |x_i - \overline{x}|}{n}\)

8.6.1 Variance

Definition 8.11 The variance \(({\sigma}^2)\) is based on the difference between the value of each observation \({x_i}\) and the mean \({\overline{x}}\). The average of the squared deviations is called the variance.
  • Refer equation (8.11)
    • Sample Variance is denoted by \(s^2\) and Population Variance is denoted by \(\sigma^2\)
    • The variance is a measure of variability that utilizes all the data.
    • The difference between each \({x_i}\) and the mean (\(\overline{x}, \mu\)) is called a deviation about the mean i.e. (\(x_i - \overline{x}\)). Sum of deviation about the mean is always zero i.e. \(\sum (x_i - \overline{x}) =0\)
    • In the computation of the variance, the deviations about the mean are squared.

\[\begin{align} \sigma^2 &= \frac{1}{n} \sum _{i=1}^{n} \left(x_i - \mu \right)^2 \\ s^2 &= \frac{1}{n-1} \sum _{i=1}^{n} \left(x_i - \overline{x} \right)^2 \end{align} \tag{8.11}\]

8.6.2 Standard Deviation

Definition 8.12 The standard deviation (\(s, \sigma\)) is defined to be the positive square root of the variance. It is a measure of the amount of variation or dispersion of a set of values.
  • Refer equation (8.12)
    • Standard deviation for sample is denoted by \({s}\) and for Population by \({\sigma}\)
    • The coefficient of variation is a relative measure of variability. It measures the standard deviation relative to the mean. It is given in percentage as \(100 \times \sigma / \mu\)

\[\begin{align} \sigma &= \sqrt{\frac{1}{N} \sum_{i=1}^N \left(x_i - \mu\right)^2} \\ {s} &= \sqrt{\frac{1}{N-1} \sum_{i=1}^N \left(x_i - \bar{x}\right)^2} \end{align} \tag{8.12}\]

8.7 Measures of Distribution Shape

8.7.1 Skewness

Definition 8.13 Skewness \((\tilde{\mu}_{3})\) is a measure of the shape of a data distribution. It is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean.
Definition 8.14 A tail refers to the tapering sides at either end of a distribution curve.
  • Data skewed to the left result in negative skewness; a symmetric data distribution results in zero skewness; and data skewed to the right result in positive skewness.
  • \(\tilde{\mu}_{3}\) is the \(3^{rd}\) standardized moment
    • Side topic: A standardized moment of a probability distribution is a moment (normally a higher degree central moment) that is normalized. The normalization is typically a division by an expression of the standard deviation which renders the moment scale invariant.
      • \(\tilde{\mu}_{1} = 0\), because the first moment about the mean is always zero.
      • \(\tilde{\mu}_{2} = 1\), because the second moment about the mean is equal to the variance \({\sigma}^2\).
      • \(\tilde{\mu}_{3}\) is a measure of skewness
      • \(\tilde{\mu}_{4}\) refers to the Kurtosis

Refer figure 8.1

  • No skew: (symmetric)
    • A unimodal distribution with zero value of skewness does not imply that this distribution is symmetric necessarily. However, a symmetric unimodal or multimodal distribution always has zero skewness. - The normal distribution has a skewness of zero. But reverse may not be true.
  • Negative skew: (left-skewed, left-tailed, or skewed to the left)
    • The left tail is longer, thus the ‘left’ refers to the left tail being drawn out
    • The curve itself appears to be leaning to the right i.e. the mass of the distribution is concentrated on the right of the figure
  • Positive skew: (right-skewed, right-tailed, or skewed to the right)
    • The right tail is longer, thus the ‘right’ refers to the right tail being drawn out
    • The curve itself appears to be leaning to the left i.e. the mass of the distribution is concentrated on the left of the figure
  • Relationship of mean and median
    • The skewness is not directly related to the relationship between the mean and median: a distribution with negative skew can have its mean greater than or less than the median, and likewise for positive skew
    • However, generally the skew can be calculated as \(({\mu} -{\nu})/\sigma\), where \({\nu}\) is median
  • Application:
    • Skewness indicates the direction and relative magnitude of deviation from the normal distribution.
    • It indicates the direction of outliers
    • With pronounced skewness, standard statistical inference procedures such as a confidence interval for a mean will be not only incorrect, in the sense that the true coverage level will differ from the nominal (e.g., 95%) level, but they will also result in unequal error probabilities on each side.

Skewness is given by the equation (8.13), which is being shown here because it looked cool has deep meaning

\[Skew = \frac{\tfrac {1}{n}\sum_{i=1}^{n}(x_{i}-{\overline {x}})^{3}}{\left[\tfrac {1}{n-1}\sum_{i=1}^{n}(x_{i}-{\overline {x}})^{2} \right]^{3/2}} \tag{8.13}\]

Charts

(Left Tail, Negative) Beta, Normal Distribution, Exponential (Positive, Right Tail)(Left Tail, Negative) Beta, Normal Distribution, Exponential (Positive, Right Tail)(Left Tail, Negative) Beta, Normal Distribution, Exponential (Positive, Right Tail)

Figure 8.1 (Left Tail, Negative) Beta, Normal Distribution, Exponential (Positive, Right Tail)

skewness()

# #Skewness Calculation: Package "e1071" (Package "moments" deprecated)
even_skew <- c(49, 50, 51)
pos_skew <- c(even_skew, 60)
neg_skew <- c(even_skew, 40)
skew_lst <- list(even_skew, pos_skew, neg_skew)
# #Mean, Median, SD
cat(paste0("Mean (neg, even, pos): ", 
           paste0(vapply(skew_lst, mean, numeric(1)), collapse = ", "), "\n"))
## Mean (neg, even, pos): 50, 52.5, 47.5
cat(paste0("Median (neg, even, pos): ", 
           paste0(vapply(skew_lst, median, numeric(1)), collapse = ", "), "\n"))
## Median (neg, even, pos): 50, 50.5, 49.5
cat(paste0("SD (neg, even, pos): ", paste0(
           round(vapply(skew_lst, sd, numeric(1)), 1), collapse = ", "), "\n"))
## SD (neg, even, pos): 1, 5.1, 5.1
#
cat(paste0("Skewness (neg, even, pos): ", paste0(
           round(vapply(skew_lst, e1071::skewness, numeric(1)), 1), collapse = ", "), "\n"))
## Skewness (neg, even, pos): 0, 0.7, -0.7
cat(paste0("Kurtosis (neg, even, pos): ", paste0(
           round(vapply(skew_lst, e1071::kurtosis, numeric(1)), 1), collapse = ", "), "\n"))
## Kurtosis (neg, even, pos): -2.3, -1.7, -1.7

Normal Exp Beta

# #Skewness Calculation: Package "e1071" (Package "moments" deprecated)
dis_lst <- list(xxNormal, xxExp, xxBeta)
#
# #Skewness: Normal has value close to 3 Kurtosis (=0 excess Kurtosis)
# #Skewness "e1071" has Type = 3 as default. Its Type = 1 matches "moments"
# #Practically, Normal has (small) NON-Zero Positive Skewness
skew_e_t3 <- vapply(dis_lst, e1071::skewness, numeric(1))
skew_e_t2 <- vapply(dis_lst, e1071::skewness, type = 2, numeric(1))
skew_e_t1 <- vapply(dis_lst, e1071::skewness, type = 1, numeric(1))
skew_mmt <-  vapply(dis_lst, moments::skewness, numeric(1))
stopifnot(identical(round(skew_e_t1, 10), round(skew_mmt, 10)))
cat(paste0("e1071: Type = 1 Skewness (Normal, Exp, Beta): ", 
           paste0(round(skew_e_t1, 4), collapse = ", "), "\n"))
## e1071: Type = 1 Skewness (Normal, Exp, Beta): 0.0407, 2.0573, -0.6279
cat(paste0("e1071: Type = 2 Skewness (Normal, Exp, Beta): ", 
           paste0(round(skew_e_t2, 4), collapse = ", "), "\n"))
## e1071: Type = 2 Skewness (Normal, Exp, Beta): 0.0407, 2.0576, -0.628
cat(paste0("e1071: Type = 3 Skewness (Normal, Exp, Beta): ", 
           paste0(round(skew_e_t3, 4), collapse = ", "), "\n"))
## e1071: Type = 3 Skewness (Normal, Exp, Beta): 0.0407, 2.057, -0.6278
#
# #Formula: (sigma_ (x_i - mu)^3) /(n * sd^3)
bb <- xxNormal
skew_man <- sum({bb - mean(bb)}^3) / {length(bb) * sd(bb)^3}
cat(paste0("(Manual) Skewness of Normal: ", round(skew_man, 4), 
           " (vs. e1071 Type 3 = ", round(skew_e_t3[1], 4), ") \n"))
## (Manual) Skewness of Normal: 0.0407 (vs. e1071 Type 3 = 0.0407)

Distributions

set.seed(3)
nn <- 10000L
# #Normal distribution is symmetrical
xxNormal <- rnorm(n = nn, mean = 0, sd = 1)
# #The exponential distribution is positive skew
xxExp <- rexp(n = nn, rate = 1)
# #The beta distribution with hyper-parameters α=5 and β=2 is negative skew
xxBeta <- rbeta(n = nn, shape1 = 5, shape2 = 2)
#
# #Save
f_setRDS(xxNormal)
f_setRDS(xxExp)
f_setRDS(xxBeta)
#f_getRDS(xxNormal)
# #Get the Distributions
xxNormal <- f_getRDS(xxNormal)
xxExp <- f_getRDS(xxExp)
xxBeta <- f_getRDS(xxBeta)

Density

# #Density Curve
# #Assumes 'hh' has data in 'ee'. In: caption_hh
#Basics
mean_hh <- mean(hh$ee)
sd_hh <- sd(hh$ee)
#
skew_hh <- skewness(hh$ee)
kurt_hh <- kurtosis(hh$ee)
# #Get Quantiles and Ranges of mean +/- sigma 
q05_hh <- quantile(hh[[1]], .05)
q95_hh <- quantile(hh[[1]], .95)
density_hh <- density(hh[[1]])
density_hh_tbl <- tibble(x = density_hh$x, y = density_hh$y)
sig3r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + 3 * sd_hh})
sig3l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - 3 * sd_hh})
sig2r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + 2 * sd_hh}, {x < mean_hh + 3 * sd_hh})
sig2l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - 2 * sd_hh}, {x > mean_hh - 3 * sd_hh})
sig1r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + sd_hh}, {x < mean_hh + 2 * sd_hh})
sig1l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - sd_hh}, {x > mean_hh - 2 * sd_hh})
sig0r_hh <- density_hh_tbl %>% filter(x > mean_hh, {x < mean_hh + 1 * sd_hh})
sig0l_hh <- density_hh_tbl %>% filter(x < mean_hh, {x > mean_hh - 1 * sd_hh})
#
# #Change x-Axis Ticks interval
xbreaks_hh <- seq(-3, 3)
xpoints_hh <- mean_hh + xbreaks_hh * sd_hh
#
# # Latex Labels 
xlabels_hh <- c(TeX(r'($\,\,\mu - 3 \sigma$)'), TeX(r'($\,\,\mu - 2 \sigma$)'), 
                TeX(r'($\,\,\mu - 1 \sigma$)'), TeX(r'($\mu$)'), TeX(r'($\,\,\mu + 1 \sigma$)'), 
                TeX(r'($\,\,\mu + 2 \sigma$)'), TeX(r'($\,\,\mu + 3\sigma$)'))
#
C03 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) + 
  geom_density(alpha = 0.2, colour = "#21908CFF") + 
  geom_area(data = sig3l_hh, aes(x = x, y = y), fill = '#440154FF') + 
  geom_area(data = sig3r_hh, aes(x = x, y = y), fill = '#440154FF') + 
  geom_area(data = sig2l_hh, aes(x = x, y = y), fill = '#3B528BFF') + 
  geom_area(data = sig2r_hh, aes(x = x, y = y), fill = '#3B528BFF') + 
  geom_area(data = sig1l_hh, aes(x = x, y = y), fill = '#21908CFF') + 
  geom_area(data = sig1r_hh, aes(x = x, y = y), fill = '#21908CFF') + 
  geom_area(data = sig0l_hh, aes(x = x, y = y), fill = '#5DC863FF') + 
  geom_area(data = sig0r_hh, aes(x = x, y = y), fill = '#5DC863FF') + 
  scale_x_continuous(breaks = xpoints_hh, labels = xlabels_hh) + 
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5), 
        axis.ticks = element_blank(), 
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(), 
        axis.line.y = element_blank(), axis.title.y = element_blank(), axis.text.y = element_blank()) + 
  labs(x = "x", y = "Density", 
       subtitle = paste0("Mean = ", round(mean_hh, 3), "; SD = ", round(sd_hh, 3), "; Skewness = ", round(skew_hh, 3), "; Kurtosis = ", round(kurt_hh, 3)), 
        caption = caption_hh, title = title_hh)
}
assign(caption_hh, C03)
rm(C03)

8.7.2 Kurtosis

Definition 8.15 Kurtosis \((\tilde{\mu}_{4})\) is a measure of the “tailedness” of the probability distribution of a real-valued random variable. Like skewness, kurtosis describes the shape of a probability distribution. For \({\mathcal {N}}_{(\mu,\, \sigma)}\), kurtosis is 3 and excess kurtosis is 0 (i.e. subtract 3).

Distributions with zero excess kurtosis are called mesokurtic. The most prominent example of a mesokurtic distribution is the normal distribution. The kurtosis of any univariate normal distribution is 3.

Distributions with kurtosis less than 3 are said to be platykurtic. It means the distribution produces fewer and less extreme outliers than does the normal distribution. An example of a platykurtic distribution is the uniform distribution, which does not produce outliers.

Distributions with kurtosis greater than 3 are said to be leptokurtic. An example of a leptokurtic distribution is the Laplace distribution, which has tails that asymptotically approach zero more slowly than a Gaussian, and therefore produces more outliers than the normal distribution.

Kurtosis is the average (or expected value) of the standardized data raised to the fourth power. Any standardized values that are less than 1 (i.e., data within one standard deviation of the mean, where the “peak” would be), contribute virtually nothing to kurtosis, since raising a number that is less than 1 to the fourth power makes it closer to zero. The only data values that contribute to kurtosis in any meaningful way are those outside the region of the peak; i.e., the outliers. Therefore, kurtosis measures outliers only; it measures nothing about the “peak.”

The sample kurtosis is a useful measure of whether there is a problem with outliers in a data set. Larger kurtosis indicates a more serious outlier problem.

# #Kurtosis Calculation: Package "e1071" (Package "moments" deprecated)
dis_lst <- list(xxNormal, xxExp, xxBeta)
#
# #Kurtosis: Normal has value close to 3 Kurtosis (=0 excess Kurtosis)
# #Kurtosis "e1071" has Type = 3 as default. Its Type = 1 matches "moments" with difference of 3
kurt_e_t3 <- vapply(dis_lst, e1071::kurtosis, numeric(1))
kurt_e_t2 <- vapply(dis_lst, e1071::kurtosis, type = 2, numeric(1))
kurt_e_t1 <- vapply(dis_lst, e1071::kurtosis, type = 1, numeric(1))
kurt_mmt <-  vapply(dis_lst, moments::kurtosis, numeric(1))
stopifnot(identical(round(kurt_e_t1, 10), round(kurt_mmt - 3, 10)))
cat(paste0("e1071: Type = 1 Kurtosis (Normal, Exp, Beta): ", 
           paste0(round(kurt_e_t1, 4), collapse = ", "), "\n"))
## e1071: Type = 1 Kurtosis (Normal, Exp, Beta): -0.0687, 6.3223, -0.106
cat(paste0("e1071: Type = 2 Kurtosis (Normal, Exp, Beta): ", 
           paste0(round(kurt_e_t2, 4), collapse = ", "), "\n"))
## e1071: Type = 2 Kurtosis (Normal, Exp, Beta): -0.0682, 6.326, -0.1055
cat(paste0("e1071: Type = 3 Kurtosis (Normal, Exp, Beta): ", 
           paste0(round(kurt_e_t3, 4), collapse = ", "), "\n"))
## e1071: Type = 3 Kurtosis (Normal, Exp, Beta): -0.0693, 6.3204, -0.1066
#
# #Formula: (sigma_ (x_i - mu)^4) /(n * sd^4)
bb <- xxNormal
kurt_man <- {sum({bb - mean(bb)}^4) / {length(bb) * sd(bb)^4}} - 3
cat(paste0("(Manual) Kurtosis of Normal: ", round(kurt_man, 4), 
           " (vs. e1071 Type 3 = ", round(kurt_e_t3[1], 4), ") \n"))
## (Manual) Kurtosis of Normal: -0.0693 (vs. e1071 Type 3 = -0.0693)

8.8 Relative Location

8.8.1 z-Scores

Measures of relative location help us determine how far a particular value is from the mean. By using both the mean and standard deviation, we can determine the relative location of any observation.

Definition 8.16 A sample of \({n}\) observations given by \({X=\{x_1,x_2,\ldots,x_n\}}\) have a sample mean \({\overline{x}}\) and the sample standard deviation, \({s}\).
Definition 8.17 The z-score, \({z_i}\), can be interpreted as the number of standard deviations \({x_i}\) is from the mean \({\overline{x}}\). It is associated with each \({x_i}\). The z-score is often called the standardized value or standard score.
  • Refer equation (8.14) (Similar to equation (11.4))
    • For example, \(z_1 = 1.2\) would indicate that \({x_1}\) is 1.2 standard deviations greater than the sample mean. Similarly, \(z_2 = -0.5\) would indicate that \({x_2}\) is 0.5 standard deviation less than the sample mean.
    • A z-score greater than zero occurs for observations with a value greater than the mean, and a z-scoreless than zero occurs for observations with a value less than the mean.
    • A z-score of zero indicates that the value of the observation is equal to the mean.
    • The z-score for any observation can be interpreted as a measure of the relative location of the observation in a data set.
    • The process of converting a value for a variable to a z-score is often referred to as a z transformation or scaling.

\[z_i = \frac{x_i - \overline{x}}{s} \tag{8.14}\]

NOTE: “Z statistic” is a special case of “Z critical” because \(\sigma/\sqrt{n}\) is the ‘standard error of the sample mean’ which means that it is a standard deviation. Rather than (eg) a known population standard deviation or even just sample standard deviation, per CLT, it is the standard deviation of the sample mean. The ‘critical Z’ (i.e. standard score) is something than can always be computed (“a general case”) whenever there is a mean and standard deviation; it translates X into a Z variable with zero mean and unit variance. (it “imposes normality” when the data may not be normal!). The “Z statistic” similarly standardizes as a special case where it is standardizing the sample mean.

Definition 8.18 Computing a z-score requires knowing the mean \({\mu}\) and standard deviation \({\sigma}\) of the complete population to which a data point belongs. If one only has a sample of observations from the population, then the analogous computation with sample mean \({\overline{x}}\) and sample standard deviation \({s}\) yields the t-statistic.

Caution:

  • Scaling does influence the interpretation of the parameters when doing many statistical analyses (regression, PCA etc) so the decision to scale should be based on how you want to interpret your parameters.
    • Although the shapes of distributions are unchanged by scaling, the distributions themselves are definitely changed.
    • Ex: After scaling a Poisson distribution, it would no longer be a Poisson distribution
    • However, scaling will not change the underlying distribution of the variable nor will it influence (positively or negatively) the violations of model assumptions.
xxflights <- f_getRDS(xxflights)
bb <- na.omit(xxflights$air_time)
# Scaling
ii <- {bb - mean(bb)} / sd(bb)
str(ii)
##  num [1:327346] 0.8145 0.8145 0.0994 0.3449 -0.3702 ...
##  - attr(*, "na.action")= 'omit' int [1:9430] 472 478 616 644 726 734 755 839 840 841 ...
# #scale() gives a Matrix with original mean and sd as its attribute
jj <- scale(bb)
str(jj)
##  num [1:327346, 1] 0.8145 0.8145 0.0994 0.3449 -0.3702 ...
##  - attr(*, "scaled:center")= num 151
##  - attr(*, "scaled:scale")= num 93.7
stopifnot(identical(as.vector(ii), as.vector(jj)))
#
hh <- tibble(ee = as.vector(jj))
title_hh <- "Flights: Air Time (Scaled)"
caption_hh <- "C03P08" #iiii

Image

Before and After ScalingBefore and After Scaling

Figure 8.2 Before and After Scaling

Histogram

# #hh$ee title_hh caption_hh
#
C03 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) +
  geom_histogram(bins = 50, alpha = 0.4, fill = '#FDE725FF') + 
  geom_vline(aes(xintercept = mean(.data[["ee"]])), color = '#440154FF') +
  annotate(geom = "text", x = mean(.[[1]]), y = -Inf, 
           label = TeX(r'($\bar{x}$)', output = "character"), 
           color = '#440154FF', hjust = -2, vjust = -2.5, parse = TRUE) +
  coord_cartesian(ylim = c(0, 35000)) +
  theme(plot.title.position = "panel") + 
  labs(x = "x", y = "Frequency", 
       subtitle = paste0("(Mean= ", round(mean(.[[1]]), 3), 
                         "; SD= ", round(sd(.[[1]]), 3),
                         ")"), 
      caption = caption_hh, title = title_hh)
}
assign(caption_hh, C03)
rm(C03)

Annotate

if(FALSE){
# #check_overlap = TRUE works for de-blurring. However, it still checks each point thus slow
geom_text(aes(label = TeX(r'($\bar{x}$)', output = "character"), 
              x = mean(.data[["ee"]]), y = -Inf),
          color = '#440154FF', hjust = -2, vjust = -2.5, parse = TRUE, check_overlap = TRUE) 
# #Create your own dataset
geom_text(data = tibble(x = mean(.[[1]]), y = -Inf, 
                        label = TeX(r'($\bar{x}$)', output = "character")), 
          aes(x = x, y = y, label = label), 
          color = '#440154FF', hjust = -2, vjust = -2.5, parse = TRUE ) 
# #Or Equivalent
ggplot2::annotate(geom = "text", x = mean(.[[1]]), y = -Inf, 
                  label = TeX(r'($\bar{x}$)', output = "character"), 
                  color = '#440154FF', hjust = -2, vjust = -2.5, parse = TRUE) 
#
ggpp::annotate(geom = "text", x = mean(.[[1]]), y = -Inf, 
               label = TeX(r'($\bar{x}$)', output = "character"), 
               color = '#440154FF', hjust = -2, vjust = -2.5, parse = TRUE) 
}

Colours

# #List All Colour Names in R
str(colors())
##  chr [1:657] "white" "aliceblue" "antiquewhite" "antiquewhite1" "antiquewhite2" "antiquewhite3" ...
# #Packages: viridis, scales, viridisLite
# #Show N Colours with Max. Contrast
q_colors <- 5
# #Display Colours
if(FALSE) show_col(viridis_pal()(q_colors))
# #Get the Viridis i.e. "D" palette Hex Values for N Colours
v_colors <-  viridis(q_colors, option = "D")
v_colors
## [1] "#440154FF" "#3B528BFF" "#21908CFF" "#5DC863FF" "#FDE725FF"

8.8.2 Chebyshev Theorem

Definition 8.19 Chebyshev Theorem can be used to make statements about the proportion of data values that must be within a specified number of standard deviations \({\sigma}\), of the mean \({\mu}\).
  • Refer to 8.19
    • Chebyshev Theorem: At least \((1-1/z^2)\) of the data values must be within z standard deviations of the mean, where z is any value greater than 1.
      • Thus, at least 75% of the data values must be within \(\overline{x} \pm 2s\), 89% within \(\overline{x} \pm 3s\), and 94% \(\overline{x} \pm 4s\).
    • Chebyshev theorem can be applied to any data set regardless of the shape of the distribution of the data.
    • Ex: Test scores of 100 students have \((\mu = 70, \sigma = 5)\)
    • How many students had test scores between 60 and 80
      • From equation (8.14), \(z_{60} = \frac{60 - 70}{5} = -2\)
      • Similarly, \(z_{80} = \frac{80 - 70}{5} = +2\)
      • According to theorem 8.19, values that must be within \({z}\) standard deviation are
        • \({(1-1/z^2) = (1 - 1/2^2) = 0.75 = 75\%}\)
        • i.e. 75 students must have test scores between 60 and 80
    • How many students had test scores between 58 and 82
      • \(z_{58} = -2.4, z_{82} = +2.4\)
      • \({(1 - 1/2.4^2) \approx 0.826 \approx 83\%}\)
        • i.e. 83 students must have test scores between 58 and 82

8.8.3 Empirical Rule

Definition 8.20 Empirical rule is used to compute the percentage of data values that must be within one, two, and three standard deviations \({\sigma}\) of the mean \({\mu}\) for a normal distribution. These probabilities are Pr(x) 68.27%, 95.45%, and 99.73%.
  • According to the empirical rule, for a Normal distribution
    • \(Pr({\mu} - 1{\sigma} \leq {X} \leq {\mu} + 1{\sigma}) \approx 68.27\%\)
    • \(Pr({\mu} - 2{\sigma} \leq {X} \leq {\mu} + 2{\sigma}) \approx 95.45\%\) i.e. mostly
    • \(Pr({\mu} - 3{\sigma} \leq {X} \leq {\mu} + 3{\sigma}) \approx 99.73\%\) i.e. almost all data values

8.9 Outliers

Definition 8.21 Sometimes unusually large or unusually small values are called outliers. It is a data point that differs significantly from other observations.
  • Reasons
    • Outliers can occur by chance in any distribution, but they often indicate either measurement error or that the population has a heavy-tailed distribution. A frequent cause of outliers is a mixture of two distributions, which may be two distinct sub-populations.
    • In most larger samplings of data, some data points will be further away from the sample mean than what is deemed reasonable. However, a small number of outliers is to be expected (and not due to any anomalous condition).
    • Estimators capable of coping with outliers are said to be robust: the median is a robust statistic of central tendency, while the mean is not. However, the mean is generally a more precise estimator.
  • Keeping vs. Removing Outliers
    • data value that has been incorrectly recorded /included - should be removed
    • unusual data value that has been recorded correctly and belongs in the data set - should be kept
  • Standardized values (z-scores) can be used to identify outliers.
    • Empirical Rule allows us to conclude that for normal distribution, almost all the data values will be within three standard deviations of the mean \((\overline{x} \pm 3s)\).
    • Hence, in using z-scores to identify outliers, we recommend treating any data value with a z-score less than −3 or greater than +3 as an outlier.
    • Such data values can then be reviewed for accuracy and to determine whether they belong in the data set.
    • In the case of normally distributed data, the three sigma rule can be used to identify outliers.
      • In a sample of 1000 observations, the presence of up to five observations deviating from the mean by more than three times the standard deviation is within the range of what can be expected. If the sample size is only 100, however, just three such outliers are already reason for concern.
  • Another approach to identifying outliers is based upon IQR
    • \(\text{Lower Limit} = Q_1 - 1.5 \space \text{IQR}\) and \(\text{Upper Limit} = Q_3 + 1.5 \space \text{IQR}\)
    • An observation is classified as an outlier if its value is less than the lower limit or greater than the upper limit.

8.10 Summary

Five-Number Summary is used to quickly summarise a dataset. i.e. Min, Q1, Median, Q3, Max

  • A boxplot is a graphical display of data based on a five-number summary.
    • By using the interquartile range, IQR = Q3 − Q1, limits are located at 1.5(IQR) below Q1 and 1.5(IQR) above Q3
    • The whiskers are drawn from the ends of the box to the smallest and largest values inside the limits
    • Boxplots can also be used to provide a graphical summary of two or more groups and facilitate visual comparisons among the groups.

BoxPlot

geom_boxplot()

Figure 8.3 geom_boxplot()

Code

# #nycflights13::weather
bb <- weather
# #NA are present in the data
summary(bb$temp)
#
# #BoxPlot
C03P01 <- bb %>% drop_na(temp) %>% mutate(month = factor(month, ordered = TRUE)) %>% {
    ggplot(data = ., mapping = aes(x = month, y = temp)) +
    #geom_violin() +
    geom_boxplot(aes(fill = month), outlier.colour = 'red', notch = TRUE) +
    stat_summary(fun = mean, geom = "point", size = 2, color = "steelblue") + 
    scale_y_continuous(breaks = seq(0, 110, 10), limits = c(0, 110)) +
    #geom_point() +
    #geom_jitter(position=position_jitter(0.2)) +
    k_gglayer_box +
    theme(legend.position = 'none') +
    labs(x = "Months", y = "Temperature", subtitle = "With Mean & Notch", 
         caption = "C03P01", title = "BoxPlot")
}

8.11 Relationship between Two Variables

8.11.1 Covariance

Definition 8.22 Covariance is a measure of linear association between two variables. Positive values indicate a positive relationship; negative values indicate a negative relationship.
  • Refer equation (8.15)
    • For a sample of size \({n}\) with the observations \((x_1, y_1), (x_2, y_2)\), and so on, the covariance is given by equation (8.15)
    • A positive value for \(s_{xy}\) indicates a positive linear association between x and y; that is, as the value of x increases, the value of y increases. Similarly a negative value shows a negative linear association.
      • In the example, \(s_{xy} = 11\)
    • If the points are evenly distributed in the scatterplot, the value of \(s_{xy}\) will be close to zero, indicating no linear association between x and y.
    • Caution: Problem with using covariance as a measure of the strength of the linear relationship is that the value of the covariance depends on the units of measurement for x and y.

\[\begin{align} \sigma_{xy} &= \frac{\sum (x_i - \mu_x)(y_i - \mu_y)}{n} \\ s_{xy} &= \frac{\sum (x_i - \overline{x})(y_i - \overline{y})}{n-1} \end{align} \tag{8.15}\]

Scatter Plot Quadrants for CovarianceScatter Plot Quadrants for Covariance

Figure 8.4 Scatter Plot Quadrants for Covariance

Covariance

# #Get 'Deviation about the mean' i.e. devX and devY and their Product devXY
ii <- bb %>% 
  mutate(devX = Commercials - mean(Commercials), devY = Sales - mean(Sales), devXY = devX * devY) 
#
# #Sample Covariance
sxy <- sum(ii$devXY) / {length(ii$devXY) -1}
print(sxy)
## [1] 11

Code

bb <- f_getRDS(xxCommercials) 

# #Define the formula for Trendline calculation
k_gg_formula <- y ~ x
#
# #Scatterplot, Trendline Equation, R2, mean x & y
C03P02 <- bb %>% {
  ggplot(data = ., aes(x = Commercials, y = Sales)) + 
  geom_smooth(method = 'lm', formula = k_gg_formula, se = FALSE) +
  stat_poly_eq(aes(label = paste0("atop(", ..eq.label.., ", \n", ..rr.label.., ")")), 
               formula = k_gg_formula, eq.with.lhs = "italic(hat(y))~`=`~",
               eq.x.rhs = "~italic(x)", parse = TRUE) +
  geom_vline(aes(xintercept = round(mean(Commercials), 3)), color = 'red', linetype = "dashed") +
  geom_hline(aes(yintercept = round(mean(Sales), 3)), color = 'red', linetype = "dashed") +
  geom_text(aes(label = TeX(r"($\bar{x} = 3$)", output = "character"), 
                x = round(mean(Commercials), 3), y = -Inf), 
            color = 'red', , hjust = -0.2, vjust = -0.5, parse = TRUE, check_overlap = TRUE) + 
  geom_text(aes(label = TeX(r"($\bar{y} = 51$)", output = "character"), 
                x = Inf, y = round(mean(Sales), 3)), 
            color = 'red', , hjust = 1.5, vjust = -0.5, parse = TRUE, check_overlap = TRUE) + 
  geom_point() +
  k_gglayer_scatter +
  labs(x = "Commercials", y = "Sales ($100s)",
       subtitle = TeX(r"(Trendline Equation, $R^{2}$, $\bar{x}$ and $\bar{y}$)"), 
       caption = "C03P02", title = "Scatter Plot")
}

More Text

  • This
    • Unlike Pearson correlation, covariance itself is not a measure of the magnitude of linear relationship. It is a measure of co-variation (which could be just monotonic). This is because covariance depends not only on the strength of linear association but also on the magnitude of the variances.
  • More details are in following links

8.11.2 Correlation Coefficient

Definition 8.23 Correlation coefficient is a measure of linear association between two variables that takes on values between −1 and +1. Values near +1 indicate a strong positive linear relationship; values near −1 indicate a strong negative linear relationship; and values near zero indicate the lack of a linear relationship.
  • Refer equation (8.16) & Table 8.1
    • The ‘Pearson Product Moment Correlation Coefficient’ or sample correlation coefficient is computed by dividing the sample covariance \(s_{xy}\) by the product of the sample standard deviation of x (\(s_{x}\)) and the sample standard deviation of y (\(s_{y}\)).
      • Values close to −1 (negative) or +1 (positive) indicate a strong linear relationship. The closer the correlation is to zero, the weaker the relationship.
    • In the example, \(s_{xy} = 11\) (Equation (8.15)) and \(s_{x} = 1.49\), \(s_{y} = 7.93\) (Equation (8.12))
    • Thus, \(r_{xy} = 0.93\)
    • Caution: Correlation provides a measure of linear association and not necessarily causation. A high correlation between two variables does not mean that changes in one variable will cause changes in the other variable.
    • Caution: Because the correlation coefficient measures only the strength of the linear relationship between two quantitative variables, it is possible for the correlation coefficient to be near zero, suggesting no linear relationship, when the relationship between the two variables is nonlinear.

\[\begin{align} \rho_{xy} &= \frac{\sigma_{xy}}{\sigma_{x}\sigma_{y}} \\ r_{xy} &= \frac{s_{xy}}{s_{x}s_{y}} \end{align} \tag{8.16}\]

Correlation

# #Get 'Deviation about the mean' i.e. devX and devY and their Product devXY
ii <- bb %>% 
  mutate(devX = Commercials - mean(Commercials), devY = Sales - mean(Sales), devXY = devX * devY) 
#
# #Sample Covariance
sxy <- sum(ii$devXY) / {length(ii$devXY) -1}
print(sxy)
## [1] 11
jj <- ii %>% mutate(devXsq = devX * devX, devYsq = devY * devY)
# #Sample Covariance Sx, Sample Standard Deviations Sx Sy
sxy <- sum(ii$devXY) / {nrow(ii) -1}
sx <- round(sqrt(sum(jj$devXsq) / {nrow(jj) -1}), 2)
sy <- round(sqrt(sum(jj$devYsq) / {nrow(jj) -1}), 2)
cat(paste0("Sxy =", sxy, ", Sx =", sx, ", Sy =", sy, "\n"))
## Sxy =11, Sx =1.49, Sy =7.93
#
# #Correlation Coefficient Rxy
rxy <- round(sxy / {sx * sy}, 2)
cat(paste0("Correlation Coefficient Rxy =", rxy, "\n"))
## Correlation Coefficient Rxy =0.93

Data

Table 8.1: (C03T01) Correlation Calculation
Week Commercials Sales devX devY devXY devXsq devYsq
1 2 50 -1 -1 1 1 1
2 5 57 2 6 12 4 36
3 1 41 -2 -10 20 4 100
4 3 54 0 3 0 0 9
5 4 54 1 3 3 1 9
6 1 38 -2 -13 26 4 169
7 5 63 2 12 24 4 144
8 3 48 0 -3 0 0 9
9 4 59 1 8 8 1 64
10 2 46 -1 -5 5 1 25

Validation


9 Probability

9.1 Overview

  • This chapter covers Probability, Factorial, Combinations, Permutations, Bayes Theorem.

9.2 Probability

Definition 9.1 Probability is a numerical measure of the likelihood that an event will occur. Probability values are always assigned on a scale from 0 to 1. A probability near zero indicates an event is unlikely to occur; a probability near 1 indicates an event is almost certain to occur.
Definition 9.2 A random experiment is a process that generates well-defined experimental outcomes. On any single repetition or trial, the outcome that occurs is determined completely by chance.
Definition 9.3 The sample space for a random experiment is the set of all experimental outcomes.
  • Random experiment of tossing a coin has a Sample Space \(S = \{\text{Head}, \text{Tail}\}\)
  • Random experiment of rolling a die has a Sample Space \(S = \{1,2,3,4,5,6\}\)
  • Random experiment of tossing Two coins has a Sample Space \(S = \{\text{HH},\text{HT},\text{TH},\text{TT}\}\)

9.3 Counting Rule

Definition 9.4 Counting Rule for Multiple-Step Experiments: If an experiment can be described as a sequence of \({k}\) steps with \({n_1}\) possible outcomes on the first step, \({n_2}\) possible outcomes on the second step, and so on, then the total number of experimental outcomes is given by \(\{(n_1)(n_2) \cdots (n_k) \}\)
Definition 9.5 A tree diagram is a graphical representation that helps in visualizing a multiple-step experiment.

9.4 Factorial

Definition 9.6 The factorial of a non-negative integer \({n}\), denoted by \(n!\), is the product of all positive integers less than or equal to n. The value of 0! is 1 i.e. \(0!=1\)

\[\begin{align} n! &= \prod _{i=1}^n i = n \cdot (n-1) \\ &= n \cdot(n-1)\cdot(n-2)\cdot(n-3)\cdot\cdots \cdot 3 \cdot 2 \cdot 1 \end{align} \tag{9.1}\]

9.5 Combinations

Definition 9.7 Combination allows one to count the number of experimental outcomes when the experiment involves selecting \({k}\) objects from a set of \({N}\) objects. The number of combinations of \({N}\) objects taken \({k}\) at a time is equal to the binomial coefficient \(C_k^N\)

\[C_k^N = \binom{N}{k} = \frac{N!}{k!(N-k)!} \tag{9.2}\]

9.6 Permutations

Definition 9.8 Permutation allows one to compute the number of experimental outcomes when \({k}\) objects are to be selected from a set of \({N}\) objects where the order of selection is important. The same \({k}\) objects selected in a different order are considered a different experimental outcome. The number of permutations of \({N}\) objects taken \({k}\) at a time is given by \(P_k^N\)

\[P_k^N = k! \binom{N}{k} = \frac{N!}{(N-k)!} \tag{9.3}\]

  • The number of permutations of \({k}\) distinct objects is \(k!\)
    • An experiment results in more permutations than combinations for the same number of objects because every selection of \({k}\) objects can be ordered in \(k!\) different ways.

9.7 Assigning Probabilities

  • Basic Requirements (Similar to the Discrete Probability & Continuos Probability)
    1. The probability assigned to each experimental outcome must be between 0 and 1, inclusively. If we let \({E_i}\) denote the \(i^{th}\) experimental outcome and \(P(E_i)\) its probability, then \(P(E_i) \in [0,1]\)
    2. The sum of the probabilities for all the experimental outcomes must equal 1. Thus for \({k}\) experimental outcomes \(\sum _{i=1}^k P(E_i) =1\)
Definition 9.9 An event is a collection of sample points. The probability of any event is equal to the sum of the probabilities of the sample points in the event. The sample space, \({s}\), is an event. Because it contains all the experimental outcomes, it has a probability of 1; that is, \(P(S) = 1\)
Definition 9.10 Given an event \({A}\), the complement of A (\(A^c\)) is defined to be the event consisting of all sample points that are not in A. Thus, \(P(A) + P(A^{c}) =1\)
Definition 9.11 Given two events A and B, the union of A and B is the event containing all sample points belonging to A or B or both. The union is denoted by \(A \cup B\)
Definition 9.12 Given two events A and B, the intersection of A and B is the event containing the sample points belonging to both A and B. The intersection is denoted by \(A \cap B\)
  • Refer to the Addition Law in the equation (9.4)

\[P(A \cup B) = P(A) + P(B) - P(A \cap B) \tag{9.4}\]

Definition 9.13 Two events are said to be mutually exclusive if the events have no sample points in common. Thus, \(A \cap B = 0\)

9.8 Exercises

  • How many ways can three items be selected from a group of six items
    • Solution: \(C_{3}^{6} = 6!/3!3! = 120\)
  • In a experiment of tossing a coin three times, how many experimental outcomes can be
    • Solution: \(2^{3} = 8\)
  • Simple random sampling uses a sample of size k from a population of size N to obtain data that can be used to make inferences about the characteristics of a population. Suppose that, from a population of 50 bank accounts, we want to take a random sample of four accounts in order to learn about the population. How many different random samples of four accounts are possible
    • Solution: \(C_{4}^{50} = 50!/4!46!\)
  • To play Powerball, a participant must select five numbers from the digits 1 through 59, and then select a Powerball number from the digits 1 through 35. To determine the winning numbers for each game, lottery officials draw 5 white balls out a drum of 59 white balls numbered 1 through 59 and 1 red ball out of a drum of 35 red balls numbered 1 through 35. To win the Powerball jackpot, numbers on the lottery must match the numbers on the 5 white balls in any order and must also match the number on the red Powerball. How many Powerball lottery outcomes are possible
    • Solution: \(C_{5}^{59} \times C_{1}^{35}\)
  • An experiment has four equally likely outcomes: E1, E2, E3, and E4
    • What is the probability that E2 occurs
      • Solution: \({1/4}\)
    • What is the probability that any two of the outcomes occur (e.g., E1 or E3)
      • Solution: \(2/4 = 1/2\)
    • What is the probability that any three of the outcomes occur (e.g., E1 or E2 or E4)
      • Solution: \({3/4}\)
  • Consider the experiment of selecting a playing card from a deck of 52 playing cards. Each card corresponds to a sample point with a 1/52 probability.
    • Probability of the event that an ace is selected
      • Solution: \(4/52 = 1/13\)
    • Probability of the event that a club is selected
      • Solution: \(13/52 = 1/4\)
    • Probability of the event that a face card (jack, queen, or king) is selected
      • Solution: \(4\times3/52\)
  • Consider the experiment of rolling a pair of dice. Suppose that we are interested in the sum of the face values showing on the dice.
    • How many sample points are possible
      • Solution: \(6 \times 6 = 36\)
    • What is the probability of obtaining a value of 7
      • Solution: \(E_{7} = \{(1,6), (6,1), (2,5), (5,2), (3,4), (4,3)\} \Rightarrow P(E_{7}) = 6/36 = 1/6\)
    • What is the probability of obtaining a value of 9 or greater
      • Solution: \(P(E_{\geq9}) = P(E_{9}, E_{10}, E_{11}, E_{12}) = \frac{4+3+2+1}{36} = \frac{5}{18}\)
    • Because each roll has six possible even values (2, 4, 6, 8, 10, and 12) and only five possible odd values (3, 5, 7, 9, and 11), the dice should show even values more often than odd values. Do you agree with this statement
      • Solution: \(\text{NO: } P(E_{\text{odd}}) = P(E_{\text{even}}) = 1/2 \iff E_{\text{odd}} = E_{\text{even}} = 18\)
  • A survey of magazine subscribers showed that 45.8% rented a car during the past 12 months for business reasons, 54% rented a car during the past 12 months for personal reasons, and 30% rented a car during the past 12 months for both business and personal reasons.
    • Let B denote Business, P denote Personal
    • What is the probability that a subscriber rented a car during the past 12 months for business or personal reasons
      • Solution: \(P(B \cup P) = P(B)+P(P)-P(B \cap P) = 0.458 + 0.540 - 0.3 = 0.698\)
    • What is the probability that a subscriber did not rent a car during the past 12 months for either business or personal reasons
    • Solution: \(P(B \cup P)^{c} = 1 - 0.698 = 0.302\)

9.9 Conditional Probability

Definition 9.14 Conditional probability is the probability of an event given that another event already occurred. The conditional probability of ‘A given B’ is \(P(A|B) = \frac{P(A \cup B)}{P(B)}\)
Table 9.1: (C04T01) Police: Promotion and Gender
Promo_Gender Men Women SUM
Promoted 288 36 324
NotPromoted 672 204 876
Total 960 240 1200
Table 9.1: (C04T01A) Joint and Marginal Probabilities
Promo_Gender Men Women SUM
Promoted 0.24 0.03 0.27
NotPromoted 0.56 0.17 0.73
Total 0.80 0.20 1.00
  • Refer to the Police Promotion Table 9.1
    • Let, M (Man), W (Woman), A (Promoted), \(A^{c}\) (Not Promoted)
    • Probability that a randomly selected officer …
      • is man and is promoted: \(P(A \cap M) = 288/1200 = 0.24\)
      • is woman and is promoted: \(P(A \cap W) = 36/1200 = 0.03\)
      • is man and is not promoted: \(P(A^{c} \cap M) = 672/1200 = 0.56\)
      • is woman and is not promoted: \(P(A^{c} \cap W) = 204/1200 = 0.17\)
      • NOTE: Each of these are Joint Probabilities because these provide intersection of two events.
    • Marginal probabilities are the values in the margins of the joint probability table and indicate the probabilities of each event separately.
      • \(P(M) = 0.80, P(W) = 0.20, P(A) = 0.27, P(A^{c}) = 0.73\)
      • Ex: the marginal probability of being promoted is \(P(A) = P(A \cap M) + P(A \cap W)\)
    • Conditional Probability Analysis
      • “the probability that an officer is promoted given that the officer is a man” \(P(A|M)\)
        • \(P(A|M) = 288/960 = 0.30\)
        • OR \(P(A|M) = P(A \cap M) / P(M) = 0.24/0.80 = 0.30\)
        • “Given that an officer is a man, that officer had a 30% chance of receiving a promotion”
      • “the probability that an officer is promoted given that the officer is a woman” \(P(A|W)\)
        • \(P(A|W) = P(A \cap W) / P(W) = 0.03/0.20 = 0.15\)
        • “Given that an officer is a woman, that officer had a 15% chance of receiving a promotion”
      • Conclusion
        • The probability of a promotion given that the officer is a man is .30, twice the .15 probability of a promotion given that the officer is a woman.
        • Although the use of conditional probability does not in itself prove that discrimination exists in this case, the conditional probability values do support this argument.
Definition 9.15 Two events A and B are independent if \(P(A|B) = P(A) \quad \text{OR} \quad P(B|A) = P(B) \Rightarrow P(A \cap B) = P(A) \cdot P(B)\)
  • Refer to the Multiplication Law in the equation (9.5)
    • Example: 84% of the households in a neighborhood subscribe to the daily edition of a newspaper; that is, \(P(D) =0.84\). In addition, it is known that the probability that a household that already holds a daily subscription also subscribes to the Sunday edition is .75; that is, \(P(S|D) =0.75\)
      • What is the probability that a household subscribes to both the Sunday and daily editions of the newspaper
        • \(P(S \cap D) = P(D) \cdot P(S|D) = 0.84 \times 0.75 = 0.63\)
        • “63% of the households subscribe to both the Sunday and daily editions”

\[\begin{align} P(A \cap B) &= P(B) \cdot P(A | B) \\ &= P(A) \cdot P(B | A) \end{align} \tag{9.5}\]

  • Mutually Exclusive vs. Independent Events
    • Two events with nonzero probabilities cannot be both mutually exclusive and independent.
    • If one mutually exclusive event is known to occur, the other cannot occur; thus, the probability of the other event occurring is reduced to zero. They are therefore dependent.

9.10 Bayes Theorem

Often, we begin the analysis with initial or prior probability estimates for specific events of interest. Then, from sources such as a sample, a special report, or a product test, we obtain additional information about the events. Given this new information, we update the prior probability values by calculating revised probabilities, referred to as posterior probabilities. Bayes theorem provides a means for making these probability calculations.

  • Refer to the equation (9.6)
    • Bayes theorem is applicable when the events for which we want to compute posterior probabilities are mutually exclusive and their union is the entire sample space.
      • An event, \(P(A)\), and its complement, \(P(A^{c})\), are mutually exclusive, and their union is the entire sample space. Thus, Bayes theorem is always applicable for computing posterior probabilities of an event and its complement.
    • Example: A firm has two suppliers, currently 65% parts are supplied by one and remaining by other; that is, \(P(A_{1}) = 0.65, P(A_{2}) = 0.35\). Quality of products supplied is 98% Good for supplier one and 95% Good for supplier 2.
      • \(P(G|A_{1}) = 0.98, P(B|A_{1}) = 0.02\)
      • \(P(G|A_{2}) = 0.95, P(B|A_{2}) = 0.05\)
      • Given that we received a Bad Part, what is the probability that it came from supplier 2
        • \(P(A_{2}|B) = \frac{P(A_{2})P(B|A_{2})}{P(A_{1}) P(B|A_{1})+ P(A_{2}) P(B|A_{2})} = \frac{0.35 \times 0.05}{0.65 \times 0.02 + 0.35 \times 0.05} = 0.5738 \approx 57\%\)
        • Similarly, \(P(A_{1}|B) = 0.4262 \approx 43\%\)
      • NOTE: While the Probability of a random part being from supplier 1 is \(P(A_{1}) = 0.65\), it is reduced to \(P(A_{1}|B) = 0.4262 \approx 43\%\) as we have received new information that the part is Bad.

\[\begin{align} P(A_{1}|B) &= \frac{P(A_{1})P(B|A_{1})}{P(A_{1}) P(B|A_{1})+ P(A_{2}) P(B|A_{2})} \\ P(A_{2}|B) &= \frac{P(A_{2})P(B|A_{2})}{P(A_{1}) P(B|A_{1})+ P(A_{2}) P(B|A_{2})} \end{align} \tag{9.6}\]

Validation


10 Discrete Probability Distributions

10.1 Overview

10.2 Definitions (Ref)

6.13 Quantitative data that measure ‘how many’ are discrete.

6.14 Quantitative data that measure ‘how much’ are continuous because no separation occurs between the possible data values.

10.3 Random Variable

Definition 10.1 A random variable is a numerical description of the outcome of an experiment. Random variables must assume numerical values. It can be either ‘discrete’ or ‘continuous.’
Definition 10.2 A random variable that may assume either a finite number of values or an infinite sequence of values such as \(0, 1, 2, \dots\) is referred to as a discrete random variable. It includes factor type i.e. Male as 0, Female as 1 etc.
Definition 10.3 A random variable that may assume any numerical value in an interval or collection of intervals is called a continuous random variable. It is given by \(x \in [n, m]\). If the entire line segment between the two points also represents possible values for the random variable, then the random variable is continuous.

10.4 Discrete Probability Distributions

Definition 10.4 The probability distribution for a random variable describes how probabilities are distributed over the values of the random variable.
Definition 10.5 For a discrete random variable x, a probability function \(f(x)\), provides the probability for each value of the random variable.
  • The use of the relative frequency method to develop discrete probability distributions leads to what is called an empirical discrete distribution.
    • We treat the data as if they were the population and use the relative frequency method to assign probabilities to the experimental outcomes.
    • The distribution of data is how often each observation occurs, and can be described by its central tendency and variation around that central tendency.
  • Basic Requirements (Similar to the Probability Basics & Continuos Probability)
    1. \(f(x) \geq 0\)
    2. \(\sum {f(x)} = 1\)
  • The simplest example of a discrete probability distribution given by a formula is the discrete uniform probability distribution; \(f(x) = 1/n\), where \({n}\) is the number of values the random variable may assume
    • Each possible value of the random variable has the same probability

10.4.1 Expected Value

Definition 10.6 The expected value, or mean, of a random variable is a measure of the central location for the random variable. i.e. \(E(x) = \mu = \sum xf(x)\)
  • NOTE
    • The expected value is a weighted average of the values of the random variable where the weights are the probabilities.
    • The expected value does not have to be a value the random variable can assume. i.e. average need not to be integer

10.4.2 Variance

Definition 10.7 The variance is a weighted average of the squared deviations of a random variable from its mean. The weights are the probabilities. i.e. \(\text{Var}(x) = \sigma^2 = \sum \{(x- \mu)^2 \cdot f(x)\}\)

10.5 Bivariate Distributions

Definition 10.8 A probability distribution involving two random variables is called a bivariate probability distribution. A discrete bivariate probability distribution provides a probability for each pair of values that may occur for the two random variables.
  • NOTE:
    • Each outcome for a bivariate experiment consists of two values, one for each random variable. Example: Rolling a pair of dice
    • Bivariate probabilities are often called joint probabilities

10.6 Ex Dicarlo

Table

Table 10.1: (C05T04) Variance Calculation
\({x}\) \(f(x)\) \(\sum xf(x)\) \((x - \mu)\) \((x - \mu)^2\) \(\sum {(x - \mu)^{2}f(x)}\)
0 0.18 0 -1.5 2.25 0.405
1 0.39 0.39 -0.5 0.25 0.0975
2 0.24 0.48 0.5 0.25 0.06
3 0.14 0.42 1.5 2.25 0.315
4 0.04 0.16 2.5 6.25 0.25
5 0.01 0.05 3.5 12.25 0.1225
Total 1.00 mu = 1.5 NA NA sigma^2 = 1.25

Data

# #Dicarlo: Days with Number of Cars Sold per day for last 300 days
xxdicarlo <- tibble(Cars = 0:5, Days = c(54, 117, 72, 42, 12, 3))
#
bb <- xxdicarlo
bb <- bb %>% rename(x = Cars, Fx = Days) %>% mutate(across(Fx, ~./sum(Fx))) %>% 
  mutate(xFx = x * Fx, x_mu = x - sum(xFx), 
             x_mu_sq = x_mu * x_mu, x_mu_sq_Fx = x_mu_sq * Fx) 
R_dicarlo_var_y_C05 <- sum(bb$x_mu_sq_Fx)
# #Total Row
bb <- bb %>% 
  mutate(across(1, as.character)) %>% 
  add_row(summarise(., across(1, ~"Total")), summarise(., across(where(is.double), sum))) %>% 
  mutate(xFx = ifelse(x == "Total", paste0("mu = ", xFx), xFx),
         x_mu_sq_Fx = ifelse(x == "Total", paste0("sigma^2 = ", x_mu_sq_Fx), x_mu_sq_Fx)) %>% 
  mutate(across(4:5, ~ replace(., x == "Total", NA)))

Change Class

# #Change Column Classes as required
bb %>% mutate(across(1, as.character))
bb %>% mutate(across(everything(), as.character))

Modify Value

bb <- xxdicarlo
ii <- bb %>% rename(x = Cars, Fx = Days) %>% mutate(across(Fx, ~./sum(Fx))) %>% 
  mutate(xFx = x * Fx, x_mu = x - sum(xFx), 
             x_mu_sq = x_mu * x_mu, x_mu_sq_Fx = x_mu_sq * Fx) 
# #Add Total Row
ii <- ii %>% 
  mutate(across(1, as.character)) %>% 
  add_row(summarise(., across(1, ~"Total")), summarise(., across(where(is.double), sum))) 
#
# #Modify Specific Row Values without using filter() 
# #filter() does not have 'un-filter()' function like group()-ungroup() combination
# #Selecting Row where x = "Total" and changing Column Values for Two Columns
ii <- ii %>% 
  mutate(xFx = ifelse(x == "Total", paste0("mu = ", xFx), xFx),
       x_mu_sq_Fx = ifelse(x == "Total", paste0("sigma^2 = ", x_mu_sq_Fx), x_mu_sq_Fx)) 
#
# #Selecting Row where x = "Total" and doing same replacement on Two Columns
ii %>% mutate(across(4:5, function(y) replace(y, x == "Total", NA)))
ii %>% mutate(across(4:5, ~ replace(., x == "Total", NA)))

10.7 Ex Dicarlo GS

Table

Table 10.2: (C05T01) Bivariate Table
Geneva_Saratoga y0 y1 y2 y3 y4 y5 SUM
x0 21 30 24 9 2 0 86
x1 21 36 33 18 2 1 111
x2 9 42 9 12 3 2 77
x3 3 9 6 3 5 0 26
Total 54 117 72 42 12 3 300
Table 10.2: (C05T02) Probability Distribution
Geneva_Saratoga y0 y1 y2 y3 y4 y5 SUM
x0 0.07 0.10 0.08 0.03 0.007 0.000 0.29
x1 0.07 0.12 0.11 0.06 0.007 0.003 0.37
x2 0.03 0.14 0.03 0.04 0.010 0.007 0.26
x3 0.01 0.03 0.02 0.01 0.017 0.000 0.09
Total 0.18 0.39 0.24 0.14 0.040 0.010 1.00

DataGS

xxdicarlo_gs <- tibble(Geneva_Saratoga = c("x0", "x1", "x2", "x3"), 
             y0 = c(21, 21, 9, 3), y1 = c(30, 36, 42, 9), y2 = c(24, 33, 9, 6), 
             y3 = c(9, 18, 12, 3), y4 = c(2, 2, 3, 5), y5 = c(0, 1, 2, 0))
bb <- xxdicarlo_gs
#
# #Tibble Total SUM 
sum_bb <- bb %>% summarise(across(-1, sum)) %>% summarise(sum(.)) %>% pull(.)
#
# #Add Total Row and SUM Column
ii <- bb %>% 
  mutate(across(1, as.character)) %>% 
  add_row(summarise(., across(1, ~"Total")), summarise(., across(where(is.numeric), sum))) %>% 
  mutate(SUM = rowSums(across(where(is.numeric))))
#
# #Convert to Bivirate Probability Distribution and then add Total Row and SUM Column
jj <- bb %>% 
  mutate(across(where(is.numeric), ~./sum_bb)) %>% 
  mutate(across(1, as.character)) %>% 
  add_row(summarise(., across(1, ~"Total")), summarise(., across(where(is.numeric), sum))) %>% 
  mutate(SUM = rowSums(across(where(is.numeric)))) %>% 
  mutate(across(where(is.numeric), format, digits =1))

Tibble Total SUM

bb <- xxdicarlo_gs
# #Assuming there is NO Total Column NOR Total Row and First Column is character
kk <- bb %>% summarise(across(where(is.numeric), sum)) %>% summarise(sum(.)) %>% pull(.)
ll <- bb %>% summarise(across(-1, sum)) %>% summarise(sum(.)) %>% pull(.)
stopifnot(identical(kk, ll))
print(kk)
## [1] 300

format()

bb <- xxdicarlo_gs
# #Round off values to 1 significant digits i.e. 0.003 or 0.02
# #NOTE: This changes the column to "character"
bb %>% mutate(across(where(is.numeric), ~./sum_bb)) %>% 
  mutate(across(where(is.numeric), format, digits =1))
## # A tibble: 4 x 7
##   Geneva_Saratoga y0    y1    y2    y3    y4    y5   
##   <chr>           <chr> <chr> <chr> <chr> <chr> <chr>
## 1 x0              0.07  0.10  0.08  0.03  0.007 0.000
## 2 x1              0.07  0.12  0.11  0.06  0.007 0.003
## 3 x2              0.03  0.14  0.03  0.04  0.010 0.007
## 4 x3              0.01  0.03  0.02  0.01  0.017 0.000

10.8 Bivariate …

  • Suppose we would like to know the probability distribution for total sales at both DiCarlo dealerships and the expected value and variance of total sales.
    • We can define \(s = x + y\) as Total Sales.
    • Refer to the Tables 10.2 and 10.3
      • \(f(s_0) = f(x_0, y_0) = 0.07\)
      • \(f(s_1) = f(x_0, y_1) + f(x_1, y_0) = 0.10 + 0.07 = 0.17\)

Table

Table 10.3: (C05T03) Bivariate Expected Value and Variance
\(ID\) \({s}\) \(f(s)\) \(\sum sf(s)\) \((s - E(s))\) \((s - E(s))^2\) \(\sum {(s - E(s))^{2}f(s)}\)
A 0 0.070 0.00 -2.64 6.99 0.489
B 1 0.170 0.17 -1.64 2.70 0.459
C 2 0.230 0.46 -0.64 0.41 0.095
D 3 0.290 0.87 0.36 0.13 0.037
E 4 0.127 0.51 1.36 1.84 0.233
F 5 0.067 0.33 2.36 5.55 0.370
G 6 0.023 0.14 3.36 11.27 0.263
H 7 0.023 0.16 4.36 18.98 0.443
I 8 0.000 0.00 5.36 28.69 0.000
Total NA 1.000 E(s) = 2.64 NA NA Var(s) = 2.389

Code

bb <- xxdicarlo_gs
sum_bb <- bb %>% summarise(across(-1, sum)) %>% summarise(sum(.)) %>% pull(.)
# #Convert to Bivariate Probability Distribution
ii <- bb %>% mutate(across(where(is.numeric), ~./sum_bb)) %>% select(-1)
# #Using tapply(), sum the Matrix
jj <- tapply(X= as.matrix(ii), INDEX = LETTERS[row(ii) + col(ii)-1], FUN = sum)
# #Create Tibble
kk <- tibble(Fs = jj, ID = LETTERS[1:length(Fs)], s = 1:length(Fs) - 1) %>% 
  relocate(Fs, .after = last_col()) %>% 
  mutate(sFs = s * Fs, s_Es = s - sum(sFs), 
             s_Es_sq = s_Es * s_Es, s_Es_sq_Fs = s_Es_sq * Fs) 
# #Save for Notebook
R_dicarlo_var_s_C05 <- sum(kk$s_Es_sq_Fs)
# #For Printing
ll <- kk %>% 
  add_row(summarise(., across(1, ~"Total")), summarise(., across(where(is.double), sum))) %>% 
  mutate(across(where(is.numeric), format, digits =2)) %>% 
  mutate(sFs = ifelse(ID == "Total", paste0("E(s) = ", sFs), sFs),
         s_Es_sq_Fs = ifelse(ID == "Total", paste0("Var(s) = ", s_Es_sq_Fs), s_Es_sq_Fs)) %>% 
  mutate(across(c(2,5,6), ~ replace(., ID == "Total", NA)))

Bivariate to Original

bb <- xxdicarlo_gs
# #From the Bivariate get the original data
ii <- bb %>% 
  mutate(Fx = rowSums(across(where(is.numeric)))) %>% 
  select(1, 8) %>% 
  separate(col = Geneva_Saratoga, into = c(NA, "x"), sep = 1) %>% 
  mutate(across(1, as.integer))
# #Variance Calculation
jj <- ii %>% mutate(across(Fx, ~./sum(Fx))) %>% 
  mutate(xFx = x * Fx, x_mu = x - sum(xFx), 
             x_mu_sq = x_mu * x_mu, x_mu_sq_Fx = x_mu_sq * Fx) 
# #Save for Notebook
R_dicarlo_var_x_C05 <- sum(jj$x_mu_sq_Fx)
print(jj)
## # A tibble: 4 x 6
##       x     Fx   xFx   x_mu x_mu_sq x_mu_sq_Fx
##   <int>  <dbl> <dbl>  <dbl>   <dbl>      <dbl>
## 1     0 0.287  0     -1.14   1.31      0.375  
## 2     1 0.37   0.37  -0.143  0.0205    0.00760
## 3     2 0.257  0.513  0.857  0.734     0.188  
## 4     3 0.0867 0.26   1.86   3.45      0.299

Sum Diagonals

bb <- xxdicarlo_gs
#
# #Tibble Total SUM 
sum_bb <- bb %>% summarise(across(-1, sum)) %>% summarise(sum(.)) %>% pull(.)
#
# #Convert to Bivirate Probability Distribution and Exclude First Character Column
ii <- bb %>% mutate(across(where(is.numeric), ~./sum_bb)) %>% select(-1)
#
# #(1A, 2B, 3C, 4D, 4E, 4F, 3G, 2H, 1I) 9 Unique Combinations = 24 (4x6) Experimental Outcomes 
matrix(data = LETTERS[row(ii) + col(ii)-1], nrow = 4)
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] "A"  "B"  "C"  "D"  "E"  "F" 
## [2,] "B"  "C"  "D"  "E"  "F"  "G" 
## [3,] "C"  "D"  "E"  "F"  "G"  "H" 
## [4,] "D"  "E"  "F"  "G"  "H"  "I"
# 
# #Using tapply(), sum the Matrix
jj <- tapply(X= as.matrix(ii), INDEX = LETTERS[row(ii) + col(ii)-1], FUN = sum)
print(jj)
##          A          B          C          D          E          F          G          H          I 
## 0.07000000 0.17000000 0.23000000 0.29000000 0.12666667 0.06666667 0.02333333 0.02333333 0.00000000
# #In place of LETTERS, Numerical Index can also be used but Letters are more clear for grouping
#tapply(X= as.matrix(ii), INDEX = c(0:8)[row(ii) + col(ii)-1], FUN = sum)
#
# #Create Tibble
kk <- tibble(Fs = jj, ID = LETTERS[1:length(Fs)], s = 1:length(Fs) - 1) %>% 
  relocate(Fs, .after = last_col())
print(kk)
## # A tibble: 9 x 3
##   ID        s     Fs
##   <chr> <dbl>  <dbl>
## 1 A         0 0.07  
## 2 B         1 0.17  
## 3 C         2 0.23  
## 4 D         3 0.29  
## 5 E         4 0.127 
## 6 F         5 0.0667
## 7 G         6 0.0233
## 8 H         7 0.0233
## 9 I         8 0

String Split

bb <- xxdicarlo_gs
# #Separate String based on Position 
bb %>% separate(col = Geneva_Saratoga, into = c("A", "B"), sep = 1) 
## # A tibble: 4 x 8
##   A     B        y0    y1    y2    y3    y4    y5
##   <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 x     0        21    30    24     9     2     0
## 2 x     1        21    36    33    18     2     1
## 3 x     2         9    42     9    12     3     2
## 4 x     3         3     9     6     3     5     0

10.9 Covariance

  • Covariance of random variables x and y is given by \(\sigma_{xy}\), Refer equation (10.1)
    • NOTE: It does not look like (8.15), but for now, I am assuming it is Equivalent
    • Calculated: \(\text{Var}(s) = \text{Var}(x+y) =\) 2.389; \(\text{Var}(y) =\) 1.25; \(\text{Var}(x) =\) 0.869
    • Variance \(\sigma_{xy} = \frac{2.3895 - 0.8696 - 1.25}{2} = 0.1350\)
    • A covariance of .1350 indicates that daily sales at the two dealerships have a positive relationship.

\[\sigma_{xy} = \frac{\text{Var}(x+y) - \text{Var}(x) - \text{Var}(y)}{2} \tag{10.1}\]

  • Correlation of random variables x and y is given by, Refer equation (8.16), \(\rho_{xy} = \frac{\sigma_{xy}}{\sigma_{x}\sigma_{y}}\)
    • Where \(\sigma_{x} = \sqrt{\text{Var}(x)} = \sqrt{0.8696} = 0.9325\); and \(\sigma_{y} = \sqrt{\text{Var}(y)} = \sqrt{1.25} = 1.1180\)
    • Correlation Coefficient \(\rho_{xy} = \frac{0.1350}{0.9325 \times 1.1180} = 0.1295\)
    • The correlation coefficient of .1295 indicates there is a weak positive relationship between the random variables representing daily sales at the two dealerships.

10.10 Distributions

  • “ForLater”
    • Binomial Probability Distribution - dbinom(), pbinom(), qbinom(), rbinom()
      • It can be used to determine the probability of obtaining \({x}\) successes in \({n}\) trials.
      • 4 Assumptions must be TRUE
        1. The experiment consists of a sequence of \({n}\) identical trials.
        2. Two outcomes are possible on each trial, one called success and the other failure.
        3. The probability of a success \({p}\) does not change from trial to trial. Consequently, the probability of failure, \(1 − p\), does not change from trial to trial.
        4. The trials are independent.
    • Poisson Probability Distribution - dpois(), ppois(), qpois(), rpois()
      • To determine the probability of obtaining \({x}\) occurrences over an interval of time or space.
      • 2 Assumptions must be TRUE
        1. The probability of an occurrence of the event is the same for any two intervals of equal length.
        2. The occurrence or nonoccurrence of the event in any interval is independent of the occurrence or nonoccurrence of the event in any other interval.
    • Hypergeometric Probability Distribution
      • Like the binomial, it is used to compute the probability of \({x}\) successes in \({n}\) trials.
      • But, in contrast to the binomial, the probability of success changes from trial to trial.

Validation


11 Continuous Probability Distributions

11.1 Overview

11.2 Definitions (Ref)

10.2 A random variable that may assume either a finite number of values or an infinite sequence of values such as \(0, 1, 2, \dots\) is referred to as a discrete random variable. It includes factor type i.e. Male as 0, Female as 1 etc.

10.3 A random variable that may assume any numerical value in an interval or collection of intervals is called a continuous random variable. It is given by \(x \in [n, m]\). If the entire line segment between the two points also represents possible values for the random variable, then the random variable is continuous.

11.3 Uniform Probability Distribution

Definition 11.1 Uniform probability distribution is a continuous probability distribution for which the probability that the random variable will assume a value in any interval is the same for each interval of equal length. Whenever the probability is proportional to the length of the interval, the random variable is uniformly distributed.
Definition 11.2 The probability that the continuous random variable \({x}\) takes a value between \([a, b]\) is given by the area under the graph of probability density function \(f(x)\); that is, \(A = \int _{a}^{b}f(x)\ dx\). Note that \(f(x)\) can be greater than 1, however its integral must be equal to 1.
  • Basic Requirements (Similar to the Probability Basics & Discrete Probability )
    1. \(f(x) \geq 0\)
    2. \(A = \int _{-\infty}^{\infty}f(x)\ dx =1\)
  • NOTE:
    • For a discrete random variable, the probability function \(f(x)\) provides the probability that the random variable assumes a particular value. With continuous random variables, the counterpart of the probability function is the probability density function \(f(x)\).
      • The difference is that the probability density function does not directly provide probabilities. - However, the area under the graph of \(f(x)\) corresponding to a given interval does provide the probability that the continuous random variable \({x}\) assumes a value in that interval.
      • So when we compute probabilities for continuous random variables we are computing the probability that the random variable assumes any value in an interval (NOT at any particular point).
      • Because the area under the graph of \(f(x)\) at any particular point is zero, the probability of any particular value of the random variable is zero.
      • It also means that the probability of a continuous random variable assuming a value in any interval is the same whether or not the endpoints are included.
    • Expected Value and Variance are given by (11.1)

\[\begin{align} E(x) &= \frac{a+b}{2} \\ \text{Var}(x) &= \frac{(b-a)^2}{12} \end{align} \tag{11.1}\]

11.4 Normal Probability Distribution

Definition 11.3 A normal distribution (\({\mathcal {N}}_{(\mu,\, \sigma^2)}\)) is a type of continuous probability distribution for a real-valued random variable.
  • The general form of its probability density function is given by equation (11.2)
    • Normal distribution \({\mathcal {N}}_{(\mu,\, \sigma)}\) is also known as Gaussian or Gauss or Laplace–Gauss distribution
    • It is symmetrical
    • The entire family of normal distributions is differentiated by two parameters: the mean \({\mu}\) and the standard deviation \({\sigma}\). They determine the location and shape of the normal distribution.
    • The highest point on the normal curve is at the mean, which is also the median and mode of the distribution.
    • The normal distribution is symmetric around its mean. Its skewness measure is zero.
    • The tails of the normal curve extend to infinity in both directions and theoretically never touch the horizontal axis.
    • Larger values of the standard deviation result in wider, flatter curves, showing more variability in the data.
    • Probabilities for the normal random variable are given by areas under the normal curve. The total area under the curve for the normal distribution is 1.
    • Values of a normal random variable are given as: \(68.27\% (\mu \pm \sigma), 95.45\% (\mu \pm 2\sigma), 99.73\% (\mu \pm 3\sigma)\). This is the basis of Empirical Rule

\[f(x)={\frac {1}{\sigma {\sqrt {2 \pi}}}} e^{-{\frac {1}{2}}\left( {\frac {x-\mu }{\sigma }}\right) ^{2}} \tag{11.2}\]

Normal Distribution

Figure 11.1 Normal Distribution

Histogram

# #Histogram with Density Curve, Mean and Median: Normal Distribution
ee <- f_getRDS(xxNormal)
hh <- tibble(ee)
ee <- NULL
# #Basics
median_hh <- round(median(hh[[1]]), 3)
mean_hh <- round(mean(hh[[1]]), 3)
sd_hh <- round(sd(hh[[1]]), 3)
len_hh <- nrow(hh)
#
# #Base Plot: Creates Only Density Function Line
ii <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) + geom_density() }
#
# #Change the line colour and alpha
ii <- ii + geom_density(alpha = 0.2, colour = "#21908CFF") 
#
# #Add Histogram with 50 bins, alpha and fill
ii <- ii + geom_histogram(aes(y = ..density..), bins = 50, alpha = 0.4, fill = '#FDE725FF')
#
# #Full Vertical Line at Mean. Goes across Function Boundary on Y-Axis
#ii <- ii + geom_vline(aes(xintercept = mean_hh), color = '#440154FF')
#
# #Shaded Area Object for line /Area upto the the Function Boundary on Y-Axis
# #Mean
ii_mean <- ggplot_build(ii)$data[[1]] %>% filter(x <= mean_hh)  
# #Median
ii_median <- ggplot_build(ii)$data[[1]] %>% filter(x <= median_hh)
#
# #To show values which are less than Mean in colour
#ii <- ii + geom_area(data = ii_mean, aes(x = x, y = y), fill = 'blue', alpha = 0.5) 
#
# #Line upto the Density Curve at Mean 
ii <- ii + geom_segment(data = ii_mean, 
             aes(x = mean_hh, y = 0, xend = mean_hh, yend = density), color = "#440154FF")
#
# #Label 'Mean' 
ii <- ii + geom_text(aes(label = paste0("Mean= ", mean_hh), x = mean_hh, y = -Inf),
            color = '#440154FF', hjust = -0.5, vjust = -1, angle = 90, check_overlap = TRUE)
#
# #Similarly, Median Line and Label
ii <- ii + geom_segment(data = ii_median, 
             aes(x = median_hh, y = 0, xend = median_hh, yend = density), color = "#3B528BFF") +
  geom_text(aes(label = paste0("Median= ", median_hh), x = median_hh, y = -Inf), 
            color = '#3B528BFF', hjust = -0.4, vjust = 1.2, angle = 90, check_overlap = TRUE) 
#
# #Change Axis Limits
ii <- ii + coord_cartesian(xlim = c(-5, 5), ylim = c(0, 0.5))
#
# #Change x-Axis Ticks interval
xbreaks_hh <- seq(-3, 3)
xpoints_hh <- mean_hh + xbreaks_hh * sd_hh
# # Latex Labels 
xlabels_hh <- c(TeX(r'($\,\,\mu - 3 \sigma$)'), TeX(r'($\,\,\mu - 2 \sigma$)'), 
                TeX(r'($\,\,\mu - 1 \sigma$)'), TeX(r'($\mu$)'), TeX(r'($\,\,\mu + 1 \sigma$)'), 
                TeX(r'($\,\,\mu + 2 \sigma$)'), TeX(r'($\,\,\mu + 3\sigma$)'))
#
ii <- ii + scale_x_continuous(breaks = xpoints_hh, labels = xlabels_hh)
#
# #Get Quantiles and Ranges of mean +/- sigma 
q05_hh <- quantile(hh[[1]],.05)
q95_hh <- quantile(hh[[1]],.95)
density_hh <- density(hh[[1]])
density_hh_tbl <- tibble(x = density_hh$x, y = density_hh$y)
sig3l_hh <- density_hh_tbl %>% filter(x <= mean_hh - 3 * sd_hh)
sig3r_hh <- density_hh_tbl %>% filter(x >= mean_hh + 3 * sd_hh)
sig2r_hh <- density_hh_tbl %>% filter(x >= mean_hh + 2 * sd_hh, x < mean_hh + 3 * sd_hh)
sig2l_hh <- density_hh_tbl %>% filter(x <= mean_hh - 2 * sd_hh, x > mean_hh - 3 * sd_hh)
sig1r_hh <- density_hh_tbl %>% filter(x >= mean_hh + sd_hh, x < mean_hh + 2 * sd_hh)
sig1l_hh <- density_hh_tbl %>% filter(x <= mean_hh - sd_hh, x > mean_hh - 2 * sd_hh)
#
# #Use (mean +/- 3 sigma) To Highlight. NOT ALL Zones have been highlighted
ii <- ii + geom_area(data = sig3l_hh, aes(x = x, y = y), fill = 'red') +
           geom_area(data = sig3r_hh, aes(x = x, y = y), fill = 'red')
#
# #Annotate Arrows 
ii <- ii + 
#  ggplot2::annotate("segment", x = xpoints_hh[4] -0.5 , xend = xpoints_hh[3], y = 0.42, 
#                    yend = 0.42, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
#  ggplot2::annotate("segment", x = xpoints_hh[4] -0.5 , xend = xpoints_hh[2], y = 0.45, 
#                    yend = 0.45, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
  ggplot2::annotate("segment", x = xpoints_hh[4] -0.5 , xend = xpoints_hh[1], y = 0.48, 
                    yend = 0.48, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
#  ggplot2::annotate("segment", x = xpoints_hh[4] +0.5 , xend = xpoints_hh[5], y = 0.42, 
#                    yend = 0.42, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
#  ggplot2::annotate("segment", x = xpoints_hh[4] +0.5 , xend = xpoints_hh[6], y = 0.45, 
#                    yend = 0.45, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
  ggplot2::annotate("segment", x = xpoints_hh[4] +0.5 , xend = xpoints_hh[7], y = 0.48, 
                    yend = 0.48, arrow = arrow(type = "closed", length = unit(0.02, "npc")))
#
# #Annotate Labels
ii <- ii + 
#  ggplot2::annotate(geom = "text", x = xpoints_hh[4], y = 0.42, label = "68.3%") +
#  ggplot2::annotate(geom = "text", x = xpoints_hh[4], y = 0.45, label = "95.4%") +
  ggplot2::annotate(geom = "text", x = xpoints_hh[4], y = 0.48, label = "99.7%")
#
# #Add a Theme and adjust Position of Title & Subtile (Both by plot.title.position) & Caption
# #"plot" or "panel"
ii <- ii + theme(#plot.tag.position = "topleft",
                 #plot.caption.position = "plot", 
                 #plot.caption = element_text(hjust = 0),
                 plot.title.position = "panel")
#
# #Title, Subtitle, Caption, Axis Labels, Tag
ii <- ii + labs(x = "x", y = "Density", 
        subtitle = paste0("(N=", len_hh, "; ", "Mean= ", mean_hh, 
                          "; Median= ", median_hh, "; SD= ", sd_hh), 
        caption = "C06AA", tag = NULL,
        title = "Normal Distribution (Symmetrical)")
#
#ii

Plot LaTex

# #Syntax 
#latex2exp::Tex(r('$\sigma =10$'), output = "character")
# #Test Equation
plot(TeX(r'(abc: $\frac{2hc^2}{\lambda^5} \, \frac{1}{e^{\frac{hc}{\lambda k_B T}} - 1}$)'), cex=2)
plot(TeX(r'(xyz: $f(x) =\frac{1}{\sigma \sqrt{2\pi}}\, e^{- \, \frac{1}{2} \,\left(\frac{x - \mu}{\sigma}\right)^2} $)'), cex=2)

Annotate Plot

# #Syntax
ggpp::annotate("text", x = -2, y = 0.3, label=TeX(r'($\sigma =10$)', output = "character"), parse = TRUE, check_overlap = TRUE)
# #NOTE: Complex Equations like Normal Distribution are crashing the R.
ggpp::annotate("text", x = -2, y = 0.3, label=TeX(r'($f(x) =\frac{1}{\sigma \sqrt{2\pi}}\, e^{- \, \frac{1}{2} \, \left(\frac{x - \mu}{\sigma}\right)^2} $)', output = "character"), parse = TRUE, check_overlap = TRUE)

ggplot_build()

# #Data
bb <- f_getRDS(xxNormal)
hh <- tibble(bb)
# #Base Plot
ii <- hh %>% { ggplot(data = ., mapping = aes(x = bb)) + geom_density() }
# #Attributes 
attributes(ggplot_build(ii))$names
## [1] "data"   "layout" "plot"
#
str(ggplot_build(ii)$data[[1]])
## 'data.frame':    512 obs. of  18 variables:
##  $ y          : num  0.000504 0.00052 0.000532 0.000541 0.000545 ...
##  $ x          : num  -3.63 -3.61 -3.6 -3.58 -3.57 ...
##  $ density    : num  0.000504 0.00052 0.000532 0.000541 0.000545 ...
##  $ scaled     : num  0.00126 0.0013 0.00133 0.00136 0.00137 ...
##  $ ndensity   : num  0.00126 0.0013 0.00133 0.00136 0.00137 ...
##  $ count      : num  5.04 5.2 5.32 5.41 5.45 ...
##  $ n          : int  10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 ...
##  $ flipped_aes: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ PANEL      : Factor w/ 1 level "1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ group      : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ ymin       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ymax       : num  0.000504 0.00052 0.000532 0.000541 0.000545 ...
##  $ fill       : logi  NA NA NA NA NA NA ...
##  $ weight     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ colour     : chr  "black" "black" "black" "black" ...
##  $ alpha      : logi  NA NA NA NA NA NA ...
##  $ size       : num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
##  $ linetype   : num  1 1 1 1 1 1 1 1 1 1 ...

Errors

ERROR 11.1 Error in is.finite(x) : default method not implemented for type ’list’
  • For ggplot() subsetting inside aes() is discouraged.
  • Assuming names(hh)[1] is “ee”
    • either use (x = “ee”) : Use of hh[1] or .[1] will throw error
    • or use (x = .data[[“ee”]]) : Use of hh[[1]] or .[[1]] will work but would throw warning.
      • Warning “Warning: Use of .[[1]] is discouraged. Use .data[[1]] instead.”
      • Using .data[[1]] will throw different error
ERROR 11.2 Error: Must subset the data pronoun with a string.
  • ggplot() | aes() | using .data[[1]] will throw this error
  • use .data[[“ee”]] or “ee”
    • .data is pronoun for an environment, it is for scope resolution, not dataframe like dot (.)

UNICODE

STOP! STOP! Just STOP! using UNICODE for R Console on WINDOWS (UTF-8 Issue).

11.5 Standard Normal

Definition 11.4 A random variable that has a normal distribution with a mean of zero \(({\mu} = 0)\) and a standard deviation of one \(({\sigma} = 1)\) is said to have a standard normal probability distribution. The z-distribution is given by \({\mathcal {z}}_{({\mu} = 0,\, {\sigma} = 1)}\)

\[f(z) = \varphi(x) = \frac{1}{\sqrt{2\pi}}e^{-\frac{z^2}{2}} \tag{11.3}\]

  • Refer equation (11.3)
    • Here, the factor \(1/{\sqrt{2\pi}}\) ensures that the total area under the curve \(\varphi(x)\) is equal to one.
    • The factor \(1/2\) in the exponent ensures that the distribution has unit variance, and therefore also unit standard deviation.
    • This function is symmetric around \(x = 0\), where it attains its maximum value \(1/{\sqrt{2\pi}}\) and has inflection points at \(x = +1\) and \(x = -1\).
    • While individual observations from normal distributions are referred to as \({x}\), they are referred to as \({z}\) in the z-distribution.

8.17 The z-score, \({z_i}\), can be interpreted as the number of standard deviations \({x_i}\) is from the mean \({\overline{x}}\). It is associated with each \({x_i}\). The z-score is often called the standardized value or standard score.

  • NOTE (R,C) notation denotes Row x Column of a Table
    • Because the standard normal random variable is continuous, \(P(z \leq 1.00) = P(z < 1.00)\)
    • The cumulative probability corresponding to \(z = 1.00\) is the table value located at the intersection of the row labeled \({1.0}\) and the column labeled \({.00}\) i.e. \(P_{\left(z\leq 1.00\right)} = P_{\left(1.0, \,.00\right)} = 0.8413\)
    • To compute the probability that \({z}\) is in the interval between −.50 and 1.25
      • \(P_{\left(-0.50 \leq z\leq 1.25\right)} = P_{\left(z\leq 1.25\right)} - P_{\left(z\leq -0.50 \right)} = P_{\left(1.2, \,.05\right)} - P_{\left(-0.50, \,.00\right)} = 0.8944 - 0.3085 = 0.5859\)
    • To compute the probability of obtaining a z value of at least 1.58
      • \(P_{\left(z\geq 1.58\right)} = 1 - P_{\left(z\leq 1.58\right)} = 1 - P_{\left(1.5, \,.08\right)} = 1 - 0.9429 = 0.0571\)
    • To compute the probability that the standard normal random variable is within one standard deviation of the mean
      • \(P_{\left(-1.00 \leq z\leq 1.00\right)} = P_{\left(z\leq 1.00\right)} - P_{\left(z\leq -1.00 \right)} = P_{\left(1.0, \,.00\right)} - P_{\left(-1.0, \,.00\right)} = 0.8413 - 0.1587 = 0.6826\)
  • Reverse i.e. given the probability, find out the z-value
    • Find a z value such that the probability of obtaining a larger z value is .10
      • The standard normal probability table gives the area under the curve to the left of a particular z value, which would be \(P_{\left(z\right)} = 1 - 0.10 = 0.9000 \approx P_{\left(1.2, \,.08\right)} \Rightarrow z = 1.28\)

\({z} \in \mathbb{R} \leftrightarrow P_{(z)} \in (0,1)\)

Cal P

# #Find Commulative Probability P corresponding to the given 'z' value
# #Area under the curve to the left of z-value = 1.00
pnorm(q = 1.00)
## [1] 0.8413447

pnorm()

# #Find Commulative Probability P corresponding to the given 'z' value
# #Area under the curve to the left of z-value = 1.00
# #pnorm(q = 1.00) #default 'lower.tail = TRUE'
z_ii <- 1.00 
p_ii <- round(pnorm(q = z_ii, lower.tail = TRUE), 4)
cat(paste0("P(z <= ", format(z_ii, nsmall = 3), ") = ", p_ii, "\n"))
## P(z <= 1.000) = 0.8413
#
# #Probability that z is in the interval between −.50 and 1.25 #0.5859
z_min_ii <- -0.50
z_max_ii <- 1.25
p_ii <- round(pnorm(q = z_max_ii, lower.tail = TRUE) - pnorm(q = z_min_ii, lower.tail = TRUE), 4)
cat(paste0("P(", format(z_min_ii, nsmall = 3), " <= z <= ", 
           format(z_max_ii, nsmall = 3), ") = ", p_ii, "\n"))
## P(-0.500 <= z <= 1.250) = 0.5858
#
# #Probability of obtaining a z value of at least 1.58 #0.0571
z_ii <- 1.58
p_ii <- round(pnorm(q = z_ii, lower.tail = FALSE), 4)
cat(paste0("P(z >= ", format(z_ii, nsmall = 3), ") = ", p_ii, "\n"))
## P(z >= 1.580) = 0.0571
#
# #Probability that the z is within one standard deviation of the mean i.e. [-1, 1] #0.6826
z_min_ii <- -1.00
z_max_ii <- 1.00
p_ii <- round(pnorm(q = z_max_ii, lower.tail = TRUE) - pnorm(q = z_min_ii, lower.tail = TRUE), 4)
cat(paste0("P(", format(z_min_ii, nsmall = 3), " <= z <= ", 
           format(z_max_ii, nsmall = 3), ") = ", p_ii, "\n"))
## P(-1.000 <= z <= 1.000) = 0.6827

Cal Z

# #Find a z value such that the probability of obtaining a larger z value is .10
# #z-value for which Area under the curve towards Right is 0.10
qnorm(p = 1 - 0.10)
## [1] 1.281552
qnorm(p = 0.10, lower.tail = FALSE)
## [1] 1.281552

qnorm()

# #Find a z value such that the probability of obtaining a larger z value is .10
# #z-value for which Area under the curve towards Right is 0.10 i.e. right >10%
#qnorm(p = 1 - 0.10)
#qnorm(p = 0.10, lower.tail = FALSE)
p_r_ii <- 0.10 
p_l_ii <- 1 - p_r_ii
z_ii <- round(qnorm(p = p_l_ii, lower.tail = TRUE), 4)
z_jj <- round(qnorm(p = p_r_ii, lower.tail = FALSE), 4)
stopifnot(identical(z_ii, z_jj))
cat(paste0("(Left) P(z) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(z) = ", 
           format(p_r_ii, nsmall = 3), ") at z = ", z_ii, "\n"))
## (Left) P(z) = 0.900 (i.e. (Right) 1-P(z) = 0.100) at z = 1.2816

11.6 Any Normal

  • Any normal distribution can be standardized by converting the individual values into z-scores.
    • z-scores tell that how many standard deviations away from the mean each value lies.
  • Probabilities for all normal distributions are computed by using the standard normal distribution.
    • A normal distribution \({\mathcal {N}}_{(\mu,\, \sigma)}\) is converted to the standard normal distribution \({\mathcal {z}}_{(\mu=0,\, \sigma=1)}\) by equation (11.4) (Similar to equation (8.14))
    • If \({x}\) is a random variable from this population, then its z-score is \(Z=\frac {X-\mu }{\sigma}\)
    • If \(\overline {X}\) is the mean of a sample of size \({n}\) from this population, then the standard error is \(\sigma/{\sqrt{n}}\) and thus the z-score is $ Z=$
    • If \(\sum {X}\) is the total of a sample of size \({n}\) from this population, then the expected total is \(n\times\mu\) and the standard error is \(\sigma {\sqrt{n}}\). Thus the z-score is \(Z={\frac {\sum {X}-n\mu }{\sigma {\sqrt {n}}}}\)
  • Thus
    • \(x=\mu \Rightarrow z =0\) i.e. A value of \({x}\) equal to its mean \({\mu}\) corresponds to \(z = 0\).
    • \(x=\mu+\sigma \Rightarrow z = 1\) i.e. an \({x}\) value that is one standard deviation above its mean \((\mu+\sigma)\) corresponds to \(z = 1\).
      • Thus, we can interpret \({z}\) as the number of standard deviations \((\sigma)\) that the normal random variable \({x}\) is from its mean \((\mu)\).
    • For a normal distribution \({\mathcal {N}}_{(\mu=10,\, \sigma=2)}\), What is the probability that the random variable x is between 10 and 14
      • At x=10, z = 0 and at x=14, z = 2, Thus
      • \(P_{\left(0 \leq z\leq 2\right)} = P_{\left(z\leq 2\right)} - P_{\left(z\leq 0 \right)} = P_{\left(2.0, \,.00\right)} - P_{\left(0, \,.00\right)} = 0.9772 - 0.5000 = 0.4772\)
    • Grear Tire Company Problem
      • For a new tire product the milage is a Normal Function \({\mathcal {N}}_{(\mu=36500,\, \sigma=5000)}\).
      • What percentage of the tires can be expected to last more than 40,000 miles, i.e., what is the probability that the tire mileage, x, will exceed 40,000
        • Solution: 24.2%
      • Let us now assume that Grear is considering a guarantee that will provide a discount on replacement tires if the original tires do not provide the guaranteed mileage. What should the guarantee mileage be if Grear wants no more than 10% of the tires to be eligible for the discount guarantee
        • Solution: \(30092 \approx 30100 \text{ miles}\)

\[z = \frac{x- \mu}{\sigma} \tag{11.4}\]

Reasons to convert normal distributions into the standard normal distribution:

  • To find the probability of observations in a distribution falling above or below a given value
  • To find the probability that a sample mean significantly differs from a known population mean
  • To compare scores on different distributions with different means and standard deviations

Each z-score is associated with a probability, or p-value, that gives the likelihood of values below that z-score occurring. By converting an individual value into a z-score, we can find the probability of all values up to that value occurring in a normal distribution.

The z-score is the test statistic used in a z-test. The z-test is used to compare the means of two groups, or to compare the mean of a group to a set value. Its null hypothesis typically assumes no difference between groups.

The area under the curve to the right of a z-score is the p-value, and it’s the likelihood of your observation occurring if the null hypothesis is true.

Usually, a p-value of 0.05 or less means that your results are unlikely to have arisen by chance; it indicates a statistically significant effect.

Cal P

# #For N(mu =10, sd =2) Probability that X is in [10, 14]
# #Same as P(0 <= z <= 2)
mu_ii <- 10
sd_ii <- 2
x_min_ii <- 10
x_max_ii <- 14
#
z_min_ii <- (x_min_ii - mu_ii) /sd_ii #0
z_max_ii <- (x_max_ii - mu_ii) /sd_ii #2
#
pz_ii <- round(pnorm(q = z_max_ii, lower.tail = TRUE) - pnorm(q = z_min_ii, lower.tail = TRUE), 4)
# #OR
px_ii <- round(pnorm(q = x_max_ii, mean = mu_ii, sd = sd_ii, lower.tail = TRUE) - 
                  pnorm(q = x_min_ii, mean = mu_ii, sd = sd_ii, lower.tail = TRUE), 4)
stopifnot(identical(pz_ii, px_ii))
cat(paste0("P(", format(z_min_ii, nsmall = 3), " <= z <= ", 
           format(z_max_ii, nsmall = 3), ") = ", pz_ii, "\n"))
## P(0.000 <= z <= 2.000) = 0.4772
cat(paste0("P(", x_min_ii, " <= x <= ", x_max_ii, ") = ", format(px_ii, nsmall = 3), "\n"))
## P(10 <= x <= 14) = 0.4772

Grear Tire

# #Grear Tire N(mu = 36500, sd =5000)
# #Probability that the tire mileage, x, will exceed 40,000 # 24.2% Tires
mu_ii <- 36500
sd_ii <- 5000
x_ii <- 40000
#
z_ii <- (x_ii - mu_ii)/sd_ii
#
#pnorm(q = 40000, mean = 36500, sd = 5000, lower.tail = FALSE)
pz_ii <- round(pnorm(q = z_ii, lower.tail = FALSE), 4)
px_ii <- round(pnorm(q = x_ii, mean = mu_ii, sd = sd_ii, lower.tail = FALSE), 4)
stopifnot(identical(px_ii, pz_ii))
#
cat(paste0("P(x >= ", x_ii, ") = ", format(px_ii, nsmall = 4), " (", 
           round(100* px_ii, 2), "%)\n"))
## P(x >= 40000) = 0.2420 (24.2%)
#
# #What should the guarantee mileage be if no more than 10% of the tires to be eligible 
# #for the discount guarantee i.e. left <10% # ~30100 miles
p_l_ii <- 0.10
p_r_ii <- 1 - p_l_ii
#
#qnorm(p = 0.10, mean = 36500, sd = 5000)
z_ii <- round(qnorm(p = p_l_ii, lower.tail = TRUE), 4)
xz_ii <- z_ii * sd_ii + mu_ii
#
x_ii <- round(qnorm(p = p_l_ii, mean = mu_ii, sd = sd_ii, lower.tail = TRUE), 4)
stopifnot(abs(xz_ii - x_ii) < 1)
cat(paste0("(Left) P(x) = ", p_l_ii, " (i.e. (Right) 1-P(z) = ", p_r_ii, 
           ") at x = ", round(x_ii, 1), "\n"))
## (Left) P(x) = 0.1 (i.e. (Right) 1-P(z) = 0.9) at x = 30092.2

Exercises

  • “ForLater”
    • Exercises
    • Normal Approximation of Binomial Probabilities
    • Exponential Probability Distribution
    • Relationship Between the Poisson and Exponential Distributions

Validation


12 Sampling Distributions

12.1 Overview

12.2 Definitions (Ref)

6.2 Elements are the entities on which data are collected. (Generally ROWS)

6.3 A variable is a characteristic of interest for the elements. (Generally COLUMNS)

6.20 A population is the set of all elements of interest in a particular study.

6.21 A sample is a subset of the population.

6.22 The measurable quality or characteristic is called a Population Parameter if it is computed from the population. It is called a Sample Statistic if it is computed from a sample.

12.3 Sample

The sample contains only a portion of the population. Some sampling error is to be expected. So, the sample results provide only estimates of the values of the corresponding population characteristics.

Definition 12.1 The sampled population is the population from which the sample is drawn.
Definition 12.2 Frame is a list of the elements that the sample will be selected from.
Definition 12.3 The target population is the population we want to make inferences about. Generally (adn preferably), it will be same as ‘Sampled-Population,’ but it may differ also.
Definition 12.4 A simple random sample (SRS) is a set of \({k}\) objects in a population of \({N}\) objects where all possible samples are equally likely to happen. The number of such different simple random samples is \(C_k^N\)
Definition 12.5 Sampling without replacement: Once an element has been included in the sample, it is removed from the population and cannot be selected a second time.
Definition 12.6 Sampling with replacement: Once an element has been included in the sample, it is returned to the population. A previously selected element can be selected again and therefore may appear in the sample more than once.
  • Infinite Population
    • Sometimes the population is infinitely large or the elements of the population are being generated by an ongoing process for which there is no limit on the number of elements that can be generated.
    • Thus, it is not possible to develop a list of all the elements in the population. This is considered the infinite population case.
    • With an infinite population, we cannot select a ‘simple random sample’ because we cannot construct a frame consisting of all the elements.
    • In the infinite population case, statisticians recommend selecting what is called a ‘random sample.’
Definition 12.7 A random sample of size \({n}\) from an infinite population is a sample selected such that the following two conditions are satisfied. Each element selected comes from the same population. Each element is selected independently. The second condition prevents selection bias.

Random sample vs. SRS

  • Random sample: every element of the population has a (nonzero) probability of being drawn.
    • each element does not necessarily have an equal chance of being chosen.
  • SRS: every element of the population has the same (nonzero) probability of being drawn.
    • SRS is thus a special case of a random sample.
    • SRS is a subset of a statistical population in which each member of the subset has an equal probability of being chosen.
  • Elaboration of the Two conditions for Random Sample
    • Example: Consider a production line designed to fill boxes of a breakfast cereal.
      • Each element selected comes from the same population.
        • To ensure this, the boxes must be selected at approximately the same point in time.
        • This way the inspector avoids the possibility of selecting some boxes when the process is operating properly and other boxes when the process is not operating properly.
      • Each element is selected independently.
        • It is satisfied by designing the production process so that each box of cereal is filled independently.
    • Example: Consider the population of customers arriving at a fast-food restaurant.
      • McDonald, implemented a random sampling procedure for this situation.
      • The sampling procedure was based on the fact that some customers presented discount coupons.
      • Whenever a customer presented a discount coupon, the next customer served was asked to complete a customer profile questionnaire. Because arriving customers presented discount coupons randomly and independently of other customers, this sampling procedure ensured that customers were selected independently.

12.4 Point Estimation

Definition 12.8 A population proportion \({P}\), is a parameter that describes a percentage value associated with a population. It is given by \(P = \frac{X}{N}\), where \({x}\) is the count of successes in the population, and \({N}\) is the size of the population. It is estimated through sample proportion \(\overline{p} = \frac{x}{n}\), where \({x}\) is the count of successes in the sample, and \({N}\) is the size of the sample obtained from the population.
Definition 12.9 To estimate the value of a population parameter, we compute a corresponding characteristic of the sample, referred to as a sample statistic. This process is called point estimation.
Definition 12.10 A sample statistic is the point estimator of the corresponding population parameter. For example, \(\overline{x}, s, s^2, s_{xy}, r_{xy}\) sample statics are point estimators for corresponding population parameters of \({\mu}\) (mean), \({\sigma}\) (standard deviation), \(\sigma^2\) (variance), \(\sigma_{xy}\) (covariance), \(\rho_{xy}\) (correlation)
Definition 12.11 The numerical value obtained for the sample statistic is called the point estimate. Estimate is used for sample value only, for population value it would be parameter. Estimate is a value while Estimator is a function.

12.5 Sampling Distributions

Definition 12.12 The sampling distribution of \({\overline{x}}\) is the probability distribution of all possible values of the sample mean \({\overline{x}}\).

Suppose, from a Population, we take a sample of size \({n}\) and calculate point estimate mean \(\overline{x}_{1}\). Further, we can select another random sample from the Population and get another point estimate mean \(\overline{x}_{2}\). If we repeat this process for 500 times, we will have a frame of \(\{\overline{x}_{1}, \overline{x}_{2}, \ldots, \overline{x}_{500}\}\).

If we consider the process of selecting a simple random sample as an experiment, the sample mean \({\overline{x}}\) is the numerical description of the outcome of the experiment. Thus, the sample mean \({\overline{x}}\) is a random variable. As a result, just like other random variables, \({\overline{x}}\) has a mean or expected value, a standard deviation, and a probability distribution. Because the various possible values of \({\overline{x}}\) are the result of different simple random samples, the probability distribution of \({\overline{x}}\) is called the sampling distribution of \({\overline{x}}\). Knowledge of this sampling distribution and its properties will enable us to make probability statements about how close the sample mean \({\overline{x}}\) is to the population mean \({\mu}\).

Just as with other probability distributions, the sampling distribution of \({\overline{x}}\) has an expected value or mean, a standard deviation, and a characteristic shape or form.

12.5.1 Mean

  • Expected Value of \({\overline{x}}\)
    • The mean of the \({\overline{x}}\) random variable is the expected value of \({\overline{x}}\).
    • Let \(E(\overline{x})\) represent the expected value of \({\overline{x}}\) and \({\mu}\) represent the mean of the population from which we are selecting a simple random sample. Then, \(E(\overline{x}) = \mu\)
    • When the expected value of a point estimator equals the population parameter, we say the point estimator is unbiased. Thus, \({\overline{x}}\) is an unbiased estimator of the population mean \({\mu}\).

12.5.2 Standard Deviation

Definition 12.13 In general, standard error \(\sigma_{\overline{x}}\) refers to the standard deviation of a point estimator. The standard error of \({\overline{x}}\) is the standard deviation of the sampling distribution of \({\overline{x}}\).
  • Standard Deviation of \({\overline{x}}\), \(\sigma_{\overline{x}}\) is given by (12.1)
    • \(\sqrt{\frac{N - n}{N-1}}\) is commonly referred to as the finite population correction factor. With large population, it approaches 1
    • Thus, \(\sigma_{\overline{x}} = \frac{\sigma}{\sqrt{n}}\) becomes good approximation when the sample size is less than or equal to 5% of the population size; that is, \(n/N \leq 0.05\).
    • To further emphasize the difference between \(\sigma_{\overline{x}}\) and \({\sigma}\), we refer to the standard deviation of \({\overline{x}}\), \(\sigma_{\overline{x}}\), as the standard error of the mean.
    • (Sampling Fluctuation) The standard error of the mean is helpful in determining how far the sample mean may be from the population mean.

\[\begin{align} \text{Finite Population:} \sigma_{\overline{x}} &= \sqrt{\frac{N - n}{N-1}}\left(\frac{\sigma}{\sqrt{n}} \right) \\ \text{Infinite Population:} \sigma_{\overline{x}} &= \frac{\sigma}{\sqrt{n}} \end{align} \tag{12.1}\]

Definition 12.14 A sampling error is the difference between a population parameter and a sample statistic.
  • Standard error is a measure of sampling error. There are others, but standard error is, by far, the most commonly used.
    • However, sampling error is NOT the only reason for a difference between the survey estimate and the true value in the population.
    • Another, and arguably more important, reason for this difference is bias.
      • Bias can be introduced when designing the sampling scheme.
      • Most forms of bias cannot be calculated nor measured after the data are collected, and are, therefore, often invisible.
      • Bias must be avoided by using correct procedures at each step of the survey process.
      • Bias has NOTHING to do with sample size which affects only sampling error and standard error.
      • As a result, large sample sizes do NOT eliminate bias. In fact, the larger sample size may increase the likelihood of bias in the data collection.

Refer Effect of Sample Size and Repeat Sampling

Effect of Sample Size vs Repeat SamplingEffect of Sample Size vs Repeat Sampling

Figure 12.1 Effect of Sample Size vs Repeat Sampling

12.6 Synopsis

“ForLater”

If a statistically independent sample of \({n}\) observations \({x_1,x_2,\ldots,x_n}\) is taken from a statistical population with a standard deviation of \(\sigma\), then the mean value calculated from the sample \(\overline{x}\) will have an associated standard error of the mean \(\sigma_\overline{x}\) given by

\[\sigma_\overline{x} = \frac{\sigma}{\sqrt{n}} \tag{12.2}\]

The standard deviation \(\sigma\) of the population being sampled is seldom known. Therefore, \(\sigma_\overline{x}\) is usually estimated by replacing \(\sigma\) with the sample standard deviation \(\sigma_{x}\) instead:

\[\sigma_\overline{x} \approx \frac{\sigma_{x}}{\sqrt{n}} \tag{12.3}\]

As this is only an ‘estimator’ for the true “standard error,” other notations are used, such as:

\[\widehat{\sigma}_\overline{x} = \frac{\sigma_{x}}{\sqrt{n}} \tag{12.4}\]

OR:

\[{s}_\overline{x} = \frac{s}{\sqrt{n}} \tag{12.5}\]

Key:

  • \(\sigma\) : Standard deviation of the population
  • \(\sigma_{x}\) : Standard deviation of the sample
  • \(\sigma_\overline{x}\) : Standard deviation of the mean
    • the standard error
  • \(\widehat{\sigma}_\overline{x}\) : Estimator of the standard deviation of the mean
    • the most often calculated quantity
    • also often colloquially called the standard error

Non-mathematical view:

  • The SD (standard deviation) quantifies scatter — how much the values vary from one another.
  • The SEM (standard error of the mean) quantifies how precisely you know the true mean of the population.
    • It takes into account both the value of the SD and the sample size.
  • Both SD and SEM are in the same units i.e. the units of the data (in contrast, variance has squared units).
  • The SEM, by definition, is always smaller than the SD. (divided by \(\sqrt{n}\))
    • The SEM gets smaller as your samples get larger.
    • So, the mean of a large sample is likely to be closer to the true population mean than is the mean of a small sample.
    • With a huge sample, you will know the value of the mean with a lot of precision even if the data is scattered.
  • The SD does not change predictably as you acquire more data.
    • The SD you compute from a sample is the best possible estimate of the SD of the overall population.
    • As you collect more data, you will assess the SD of the population with more precision. But you can not predict whether the SD from a larger sample will be bigger or smaller than the SD from a small sample.
    • Technically, variance does not change predictably. Above is a simplification. For details, see Difference between SE and SD

12.6.1 Form

Form of the Sampling Distribution of \({\overline{x}}\)

  • When the population has a normal distribution, the sampling distribution of \({\overline{x}}\) is normally distributed for any sample size.

  • When the population from which we are selecting a random sample does not have a normal distribution, the central limit theorem is helpful in identifying the shape of the sampling distribution of \({\overline{x}}\).

Definition 12.15 Central Limit Theorem: In selecting random samples of size \({n}\) from a population, the sampling distribution of the sample mean \({\overline{x}}\) can be approximated by a normal distribution as the sample size becomes large.

How large the sample size needs to be before the central limit theorem applies and we can assume that the shape of the sampling distribution is approximately normal

  • For most applications, the sampling distribution of \({\overline{x}}\) can be approximated by a normal distribution whenever the sample is size 30 or more.
  • In cases where the population is highly skewed or outliers are present, samples of size 50 may be needed.
  • Finally, if the population is discrete, the sample size needed for a normal approximation often depends on the population proportion.

Ex: EAI

Task of developing a profile of 2500 managers. The characteristics to be identified include the mean annual salary for the managers and the proportion of managers having completed a training.

  • Population
    • Population Size N = 2500 managers
    • Training: 1500/2500 managers have completed Training
    • Salary: \({\mathcal {N}}_{(\mu = 51800,\, \sigma =4000)}\)
    • Proportion of the population that completed the training program \(p = \frac{1500}{2500} = 0.60\)
  • Suppose that a sample of 30 managers will be used. i.e. \({n=30}\) and 19 Yes for Training
    • Suppose, sample have \({\mathcal {N}}_{(\overline{x} = 51814,\, s =3348)}\)
    • Also, \(\overline{p} = \frac{x}{n} = \frac{19}{30} = 0.63\)
  • If 500 such samples are taken, where each have their own \({\overline{x}}\)
    • Then their expected value \(E(\overline{x}) = \mu = 51800\)
    • Standard Error \(\sigma_{\overline{x}} = \frac{\sigma}{\sqrt{n}} = \frac{4000}{\sqrt{30}} = 730.3\)
  • Suppose the director believes the sample mean \({\overline{x}}\) will be an acceptable estimate of the population mean \({\mu}\) if the sample mean is within $500 of the population mean.
    • However, it is not possible to guarantee that the sample mean will be within $500 of the population
    • We can reframe the request in probability terms i.e.
      • What is the probability that the sample mean computed using a simple random sample of 30 EAI managers will be within $500 of the population mean
      • i.e. Probability that \(\overline{x} \in [51300, 52300]\)
        • For \(z = \frac{\overline{x} - \mu}{\sigma_{\overline{x}}}\)
        • For \(\overline{x} = 52300 \Rightarrow z = \frac{52300 - 51800}{730.30} = 0.68\)
        • For \(\overline{x} = 51300 \Rightarrow z = \frac{51300 - 51800}{730.30} = -0.68\)
    • \(P_{(51300 \leq \overline{x} \leq 52300)} = P_{(\overline{x} \leq 52300)} - P_{(\overline{x} \leq 51300)} = P_{(z \leq 0.68)} - P_{(z \leq -0.68)} = 0.7517 - 0.2483 = 0.5034\)
      • A simple random sample of 30 EAI managers has a 0.5034 probability of providing a sample mean \({\overline{x}}\) that is within $500 of the population mean.
        • Thus, there is a \(1 − 0.5034 = 0.4966\) probability that the difference between \({\overline{x}}\) and \({\mu}\) will be more than $500.
        • In other words, a simple random sample of 30 EAI managers has roughly a 50–50 chance of providing a sample mean within the allowable $500. Perhaps a larger sample size should be considered.
        • Let us explore this possibility by considering the relationship between the sample size and the sampling distribution of \({\overline{x}}\).
  • Impact of \(n = 100\) in place of \(n =30\)
    • First note that \(E(\overline{x}) = \mu\) regardless of the sample size. Thus, the mean of all possible values of \({\overline{x}}\) is equal to the population mean \({\mu}\) regardless of the sample size \({n}\).
    • However, standard error is reduced to \(\sigma_{\overline{x}} = \frac{4000}{\sqrt{100}} = 400\)
    • For \(\overline{x} = 52300 \Rightarrow z = \frac{52300 - 51800}{400} = 1.25\)
    • For \(\overline{x} = 51300 \Rightarrow z = \frac{51300 - 51800}{400} = -1.25\)
    • Thus \(P_{(51300 \leq \overline{x} \leq 52300)} = P_{(\overline{x} \leq 52300)} - P_{(\overline{x} \leq 51300)} = P_{(z \leq 1.25)} - P_{(z \leq -1.25)} = 0.8944 - 0.1056 = 0.7888\)
    • Thus, by increasing the sample size from 30 to 100 EAI managers, we increase the probability of obtaining a sample mean within $500 of the population mean from 0.5034 to 0.7888.

Caution: Here, we took advantage of the fact that the population mean \({\mu}\) and the population standard deviation \({\sigma}\) were known. However, usually these values will be unknown.

“ForLater”

Properties of Point Estimators

Three properties of good point estimators: unbiased, efficiency, and consistency.

\(\theta = \text{the population parameter of interest}\) \(\hat{\theta} = \text{the sample statistic or point estimator of } \theta\)

  • Unbiased
    • If the expected value of the sample statistic is equal to the population parameter being estimated, the sample statistic is said to be an unbiased estimator of the population parameter
  • Efficiency
    • When sampling from a normal population, the standard error of the sample mean is less than the standard error of the sample median. Thus, the sample mean is more efficient than the sample median.
  • Consistency
    • A point estimator is consistent if the values of the point estimator tend to become closer to the population parameter as the sample size becomes larger.

Other Sampling Methods

  • Stratified Random Sampling
  • Cluster Sampling
  • Systematic Sampling
  • Convenience Sampling
  • Judgment Sampling

Validation


13 Interval Estimation

13.1 Overview

13.2 Interval Estimate

Definition 13.1 Because a point estimator cannot be expected to provide the exact value of the population parameter, an interval estimate is often computed by adding and subtracting a value, called the margin of error, to the point estimate. \(\text{Interval Estimate} = \text{Point Estimate} \pm \text{Margin of Error}\)
Definition 13.2 Confidence interval is another name for an interval estimate. Normally it is given as \((1 - \alpha)\). Ex: 95% confidence interval
Definition 13.3 The confidence level expressed as a decimal value is the confidence coefficient (\(1-{\alpha}\)). i.e. 0.95 is the confidence coefficient for a 95% confidence level.

Known SD

In order to develop an interval estimate of a population mean, either the population standard deviation \({\sigma}\) or the sample standard deviation \({s}\) must be used to compute the margin of error. In most applications \({\sigma}\) is not known, and \({s}\) is used to compute the margin of error.

In some applications, large amounts of relevant historical data are available and can be used to estimate the population standard deviation prior to sampling. Also, in quality control applications where a process is assumed to be operating correctly, or ‘in control,’ it is appropriate to treat the population standard deviation as known.

Sampling distribution of \({\overline{x}}\) can be used to compute the probability that \({\overline{x}}\) will be within a given distance of \({\mu}\).

Example: Lloyd Department Store

  • Each week Lloyd Department Store selects a simple random sample of 100 customers in order to learn about the amount spent per shopping trip.
    • With \({x}\) representing the amount spent per shopping trip, the sample mean \({\overline{x}}\) provides a point estimate of \({\mu}\), the mean amount spent per shopping trip for the population of all Lloyd customers. Based on the historical data, Lloyd now assumes a known value of \(\sigma = 20\) for the population standard deviation.
    • During the most recent week, Lloyd surveyed 100 customers \((n = 100)\) and obtained a sample mean of \(\overline{x} = 82\).
    • we can conclude that the sampling distribution of \({\overline{x}}\) follows a normal distribution with a standard error of \(\sigma_{\overline{x}} = \frac{\sigma}{\sqrt{n}} = \frac{20}{\sqrt{100}} =2\).
    • Because the sampling distribution shows how values of \({\overline{x}}\) are distributed around the population mean \({\mu}\), the sampling distribution of \({\overline{x}}\) provides information about the possible differences between \({\overline{x}}\) and \({\mu}\).
    • Using the standard normal probability table, we find that 95% of the values of any normally distributed random variable are within \(\pm 1.96\) standard deviations of the mean i.e. \([\mu - 1.96 \sigma, \mu + 1.96\sigma]\).
      • Thus, 95% of the \({\overline{x}}\) values must be within \(\pm 1.96 \sigma_{\overline{x}}\) of the mean \({\mu}\).
      • In the Lloyd example we know that the sampling distribution of \({\overline{x}}\) is normally distributed with a standard error of \(\sigma_{\overline{x}} =2\).
      • we can conclude that 95% of all \({\overline{x}}\) values obtained using a sample size of \(n = 100\) will be within \((\pm 1.96 \times 2 = \pm 3.92)\) of the population mean \({\mu}\).
    • As given above, sample mean was \(\overline{x} = 82\)
      • Interval estimate of \(\overline{x} = 82 \pm 3.92 = [78.08, 85.92]\)
      • Because 95% of all the intervals constructed using \(\overline{x} = 82 \pm 3.92\) will contain the population mean, we say that we are 95% confident that the interval 78.08 to 85.92 includes the population mean \({\mu}\).
      • We say that this interval has been established at the 95% confidence level.
      • The value 0.95 is referred to as the confidence coefficient, and the interval 78.08 to 85.92 is called the 95% confidence interval.

Interval Estimate of a Population Mean: \({\sigma}\) known is given by equation (13.1)

\[\begin{align} \overline{x} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}} \end{align} \tag{13.1}\]

where \((1 − \alpha)\) is the confidence coefficient and \(z_{\alpha/2}\) is the z-value providing an area of \(\alpha/2\) in the upper tail of the standard normal probability distribution.

For a 95% confidence interval, the confidence coefficient is \((1 − \alpha) = 0.95\) and thus, \(\alpha = 0.05\). Using the standard normal probability table, an area of \(\alpha/2 = 0.05/2 = 0.025\) in the upper tail provides \(z_{.025} = 1.96\).

# #Find z-value for confidence interval 95% i.e. (1-alpha) = 0.95 i.e. alpha = 0.05
# #To look for Area under the curve towards Right only i.e. alpha/2 = 0.025
p_r_ii <- 0.025
p_l_ii <- 1 - p_r_ii
z_ii <- round(qnorm(p = p_l_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(z) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(z) = ", 
           format(p_r_ii, nsmall = 3), ") at z = ", z_ii, "\n"))
## (Left) P(z) = 0.975 (i.e. (Right) 1-P(z) = 0.025) at z = 1.96
#
# #Critical Value (z) for Common Significance level Alpha (α) or Confidence level (1-α)
xxalpha <- c("10%" = 0.1, "5%" = 0.05, "5/2%" = 0.025, "1%" = 0.01, "1/2%" = 0.005)
#
# #Left Tail Test
round(qnorm(p = xxalpha, lower.tail = TRUE), 4)
##     10%      5%    5/2%      1%    1/2% 
## -1.2816 -1.6449 -1.9600 -2.3263 -2.5758
#
# #Right Tail Test
round(qnorm(p = xxalpha, lower.tail = FALSE), 4)
##    10%     5%   5/2%     1%   1/2% 
## 1.2816 1.6449 1.9600 2.3263 2.5758

13.3 Unknown SD

Definition 13.4 When \({s}\) is used to estimate \({\sigma}\), the margin of error and the interval estimate for the population mean are based on a probability distribution known as the t distribution.

The t distribution is a family of similar probability distributions, with a specific t distribution depending on a parameter known as thedegrees of freedom. As the number of degrees of freedom increases, the difference between the t distribution and the standard normal distribution becomes smaller and smaller.

Just as \(z_{0.025}\) was used to indicate the z value providing a 0.025 area in the upper tail of a standard normal distribution, \(t_{0.025}\) indicates a 0.025 area in the upper tail of a t distribution. In general, the notation \(t_{\alpha/2}\) represents a t value with an area of \(\alpha/2\) in the upper tail of the t distribution.

As the degrees of freedom increase, the t distribution approaches the standard normal distribution. Ex: \(t_{0.025} = 2.262 \,(\text{DOF} = 9)\), \(t_{0.025} = 2.200 \,(\text{DOF} = 60)\), and \(t_{0.025} = 1.96 \,(\text{DOF} = \infty) = z_{0.025}\)

Interval Estimate of a Population Mean: \({\sigma}\) Unknown is given by equation (13.2)

\[\begin{align} \overline{x} \pm t_{\alpha/2} \frac{s}{\sqrt{n}} \end{align} \tag{13.2}\]

where \({s}\) is the sample standard deviation, \((1 − \alpha)\) is the confidence coefficient and \(t_{\alpha/2}\) is the t-value providing an area of \(\alpha/2\) in the upper tail of the t distribution with \({n-1}\) degrees of freedom.

Refer equation (8.12), the expression for the sample standard deviation is

\[{s} = \sqrt{\frac{\sum \left(x_i - \bar{x}\right)^2}{n-1}}\]

Definition 13.5 The number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary. In general, the degrees of freedom of an estimate of a parameter are \((n - 1)\).

Why \((n-1)\) are the degrees of freedom

  • Degrees of freedom refer to the number of independent pieces of information that go into the computation. i.e. \(\{(x_{1}-\overline{x}), (x_{2}-\overline{x}), \ldots, (x_{n}-\overline{x})\}\)
  • However, \(\sum (x_{i}-\overline{x}) = 0\) for any data set.
  • Thus, only \((n − 1)\) of the \((x_{i}-\overline{x})\) values are independent.
    • if we know \((n − 1)\) of the values, the remaining value can be determined exactly by using the condition.

Larger sample sizes are needed if the distribution of the population is highly skewed or includes outliers.

Cal T

# #Like pnorm() is for P(z) and qnorm() is for z, pt() is for P(t) and qt() is for t.
# #Find t-value for confidence interval 95% i.e. (1-alpha) = 0.95 i.e. alpha = 0.05
# #To look for Area under the curve towards Right only i.e. alpha/2 = 0.025
p_r_ii <- 0.025
p_l_ii <- 1 - p_r_ii
#
# #t-tables are unique for different degrees of freedom i.e. for DOF = 9 
dof_ii <- 9
t_ii <- round(qt(p = p_l_ii, df = dof_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(t) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(t) = ", 
           format(p_r_ii, nsmall = 3), ") at t = ", t_ii, " (dof = ", dof_ii, ")\n"))
## (Left) P(t) = 0.975 (i.e. (Right) 1-P(t) = 0.025) at t = 2.2622 (dof = 9)

qt()

# #Like pnorm() is for P(z) and qnorm() is for z, pt() is for P(t) and qt() is for t.
# #Find t-value for confidence interval 95% i.e. (1-alpha) = 0.95 i.e. alpha = 0.05
# #To look for Area under the curve towards Right only i.e. alpha/2 = 0.025
p_r_ii <- 0.025
p_l_ii <- 1 - p_r_ii
#
# #t-tables are unique for different degrees of freedom i.e. for DOF = 9 
dof_ii <- 9
t_ii <- round(qt(p = p_l_ii, df = dof_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(t) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(t) = ", 
           format(p_r_ii, nsmall = 3), ") at t = ", t_ii, " (dof = ", dof_ii, ")\n"))
## (Left) P(t) = 0.975 (i.e. (Right) 1-P(t) = 0.025) at t = 2.2622 (dof = 9)
#
dof_ii <- 60
t_ii <- round(qt(p = p_l_ii, df = dof_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(t) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(t) = ", 
           format(p_r_ii, nsmall = 3), ") at t = ", t_ii, " (dof = ", dof_ii, ")\n"))
## (Left) P(t) = 0.975 (i.e. (Right) 1-P(t) = 0.025) at t = 2.0003 (dof = 60)
#
dof_ii <- 600
t_ii <- round(qt(p = p_l_ii, df = dof_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(t) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(t) = ", 
           format(p_r_ii, nsmall = 3), ") at t = ", t_ii, " (dof = ", dof_ii, ")\n"))
## (Left) P(t) = 0.975 (i.e. (Right) 1-P(t) = 0.025) at t = 1.9639 (dof = 600)
#
# #t-table have Infinity Row which is same as z-table. For DOF >100, it can be used.
dof_ii <- Inf
t_ii <- round(qt(p = p_l_ii, df = dof_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(t) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(t) = ", 
           format(p_r_ii, nsmall = 3), ") at t = ", t_ii, " (dof = ", dof_ii, ")\n"))
## (Left) P(t) = 0.975 (i.e. (Right) 1-P(t) = 0.025) at t = 1.96 (dof = Inf)

#
z_ii <- round(qnorm(p = p_l_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(z) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(z) = ", 
           format(p_r_ii, nsmall = 3), ") at z = ", z_ii, "\n"))
## (Left) P(z) = 0.975 (i.e. (Right) 1-P(z) = 0.025) at z = 1.96

Ex: Credit Card

# #A sample of n = 70 households provided the credit card balances.
xxCreditCards <- c(9430, 7535, 4078, 5604, 5179, 4416, 10676, 1627, 10112, 6567, 13627, 18719, 14661, 12195, 10544, 13659, 7061, 6245, 13021, 9719, 2200, 10746, 12744, 5742, 7159, 8137, 9467, 12595, 7917, 11346, 12806, 4972, 11356, 7117, 9465, 19263, 9071, 3603, 16804, 13479, 14044, 6817, 6845, 10493, 615, 13627, 12557, 6232, 9691, 11448, 8279, 5649, 11298, 4353, 3467, 6191, 12851, 5337, 8372, 7445, 11032, 6525, 5239, 6195, 12584, 15415, 15917, 12591, 9743, 10324)
f_setRDS(xxCreditCards)
bb <- f_getRDS(xxCreditCards)
mean_bb <- mean(bb)
sd_bb <- sd(bb)
dof_bb <- length(bb) - 1L
# #t-value for confidence interval 95% | (1-alpha) = 0.95 | alpha = 0.05 | alpha/2 = 0.025
p_r_ii <- 0.025
p_l_ii <- 1 - p_r_ii
#
dof_ii <- dof_bb
t_ii <- round(qt(p = p_l_ii, df = dof_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(t) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(t) = ", 
           format(p_r_ii, nsmall = 3), ") at t = ", t_ii, " (dof = ", dof_ii, ")\n"))
## (Left) P(t) = 0.975 (i.e. (Right) 1-P(t) = 0.025) at t = 1.9949 (dof = 69)
#
# #Interval Estimate
err_margin_bb <- t_ii * sd_bb / sqrt(length(bb))
est_l <- mean_bb - err_margin_bb
est_r <- mean_bb + err_margin_bb
#
cat(paste0("Normal Sample (n=", length(bb), ", mean=", mean_bb, ", sd=", round(sd_bb, 1),
           "):\n Point Estimate = ", mean_bb, ", Margin of error = ", round(err_margin_bb, 1), 
           ", ", (1-2*p_r_ii) * 100, "% confidence interval is [", 
           round(est_l, 1), ", ", round(est_r, 1), "]"))
## Normal Sample (n=70, mean=9312, sd=4007):
##  Point Estimate = 9312, Margin of error = 955.4, 95% confidence interval is [8356.6, 10267.4]

“ForLater”

  • Determining the Sample Size
  • Population Proportion

Validation


14 Hypothesis Tests

14.1 Overview

14.2 Hypothesis Testing

Definition 14.1 Hypothesis testing is a process in which, using data from a sample, an inference is made about a population parameter or a population probability distribution.

Note:

  • Hypothesis testing is used to determine whether a statement about the value of a population parameter should or should not be rejected.
  • It is the process to check whether the sample information is matching with population information.
  • The hypothesis testing procedure uses data from a sample to test the two competing statements indicated by \({H_0}\) and \({H_a}\)
Definition 14.2 Null Hypothesis \((H_0)\) is a tentative assumption about a population parameter. It is assumed True, by default, in the hypothesis testing procedure.
Definition 14.3 Alternative Hypothesis \((H_a)\) is the complement of the Null Hypothesis. It is concluded to be True, if the Null Hypothesis is rejected.

Note:

  • The conclusion that the alternative hypothesis \((H_a)\) is true is made if the sample data provide sufficient evidence to show that the null hypothesis \((H_0)\) can be rejected.
  • The null and alternative hypotheses are competing statements about the population. Either the null hypothesis \({H_0}\) is true or the alternative hypothesis \({H_a}\) is true, but not both.

14.3 Developing Null and Alternative Hypotheses

All hypothesis testing applications involve collecting a sample and using the sample results to provide evidence for drawing a conclusion.

In some situations it is easier to identify the alternative hypothesis first and then develop the null hypothesis.

  • The Alternative Hypothesis as a Research Hypothesis
    • A new fuel injection system designed to increase the miles-per-gallon rating from the current value 24 miles per gallon.
      • \(H_a : \mu > 24 \iff H_0: \mu \leq 24\)
    • A new teaching method is developed that is believed to be better than the current method.
      • \(H_a : \text{\{New method is better}\} \iff H_0: \text{\{New method is NOT better}\}\)
    • A new sales force bonus plan is developed in an attempt to increase sales.
      • \(H_a : \text{\{New plan increases sales}\} \iff H_0: \text{\{New plan does not increase sales}\}\)
    • A new drug is developed with the goal of lowering blood pressure more than an existing drug.
      • \({H_a}\) : New drug lowers blood pressure more than the existing drug
      • \({H_0}\) : New drug does not provide lower blood pressure than the existing drug
    • In each case, rejection of the null hypothesis \({H_0}\) provides statistical support for the research hypothesis \({H_a}\).
  • The Null Hypothesis as an Assumption to Be Challenged
    • The null hypothesis \({H_0}\) expresses the belief or assumption about the value of the population parameter. The alternative hypothesis \({H_a}\) is that the belief or assumption is incorrect.
    • Ex: The label on a soft drink bottle states that it contains 67.6 fluid ounces.
      • We consider the label correct provided the population mean filling weight for the bottles is at least 67.6 fluid ounces.
      • Without any reason to believe otherwise, we would give the manufacturer the benefit of the doubt and assume that the statement provided on the label is correct.
      • \(H_0 : \mu \geq 67.6 \iff H_a: \mu < 67.6\)
      • If the sample results lead to the conclusion to reject \({H_0}\), the inference that \(H_a: \mu < 67.6\) is true can be made. With this statistical support, the agency is justified in concluding that the label is incorrect and underfilling of the bottles is occurring. Appropriate action to force the manufacturer to comply with labeling standards would be considered.
      • However, if the sample results indicate \({H_0}\) cannot be rejected, the assumption that the labeling is correct cannot be rejected. With this conclusion, no action would be taken.
      • A product information is usually assumed to be true and stated as the null hypothesis. The conclusion that the information is incorrect can be made if the null hypothesis is rejected.
    • Same situation, from the point of view of the manufacturer
      • The company does not want to underfill the containers (legal requirement). However, the company does not want to overfill containers either because it would be an unnecessary cost.
      • \(H_0 : \mu = 67.6 \iff H_a: \mu \neq 67.6\)
      • If the sample results lead to the conclusion to reject \({H_0}\), the inference is made that \(H_a: \mu \neq 67.6\) is true. We conclude that the bottles are not being filled properly and the production process should be adjusted.
      • However, if the sample results indicate \({H_0}\) cannot be rejected, the assumption that the process is functioning properly cannot be rejected. In this case, no further action would be taken.

14.4 Three forms of hypotheses

For hypothesis tests involving a population mean, we let \({\mu}_0\) denote the hypothesized value and we must choose one of the following three forms for the hypothesis test.

Alternative is One-Sided, if it states that a parameter is larger or smaller than the null value. Alternative is Two-sided, if it states that the parameter is different from the null value.

Definition 14.4 \(\text{\{Left Tail or Lower Tail\} } {H_0} : {\mu} \geq {\mu}_0 \iff {H_a}: {\mu} < {\mu}_0\)
Definition 14.5 \(\text{\{Right Tail or Upper Tail\} } {H_0} : {\mu} \leq {\mu}_0 \iff {H_a}: {\mu} > {\mu}_0\)
Definition 14.6 \(\text{\{Two Tail\} } {H_0} :{\mu} = {\mu}_0 \iff {H_a}: {\mu} \neq {\mu}_0\)

Notes

  • The equality part of the expression \(\{\mu \geq \mu_0 \, | \, \mu \leq \mu_0 \, | \, \mu = \mu_0\}\) always appears in the null hypothesis \({H_0}\).
  • Alternative hypothesisis often what the test is attempting to establish.
    • Hence, asking whether the user is looking for evidence to support \(\{\mu < \mu_0 \, | \, \mu > \mu_0 \, | \, \mu \neq \mu_0\}\) will help determine \({H_a}\)

Exercises

  • The manager of an automobile dealership is considering a new bonus plan designed to increase sales volume. Currently, the mean sales volume is 14 automobiles per month. The manager wants to conduct a research study to see whether the new bonus plan increases sales volume.
    • Solution: \(H_0 : \mu \leq 14 \iff H_a: \mu > 14\)
  • A director of manufacturing must convince management that a proposed manufacturing method reduces costs before the new method can be implemented. The current production method operates with a mean cost of $220 per hour.
    • Solution: \(H_0 : \mu \geq 220 \iff H_a: \mu < 220\)

14.5 Type I and Type II Errors

Refer Type I and Type II Errors (B12)

Ideally the hypothesis testing procedure should lead to the acceptance of \({H_0}\) when \({H_0}\) is true and the rejection of \({H_0}\) when \({H_a}\) is true. Unfortunately, the correct conclusions are not always possible. Because hypothesis tests are based on sample information, we must allow for the possibility of errors.

Type-I $(\alpha)$ and Type-II $(\beta)$ Errors

Figure 14.1 Type-I \((\alpha)\) and Type-II \((\beta)\) Errors

Definition 14.7 The error of rejecting \({H_0}\) when it is true, is Type I error \(({\alpha})\).
Definition 14.8 The error of accepting \({H_0}\) when it is false, is Type II error \(({\beta})\).
Definition 14.9 The level of significance \((\alpha)\) is the probability of making a Type I error when the null hypothesis is true as an equality.

13.3 The confidence level expressed as a decimal value is the confidence coefficient (\(1-{\alpha}\)). i.e. 0.95 is the confidence coefficient for a 95% confidence level.

In practice, the person responsible for the hypothesis test specifies the level of significance. By selecting \({\alpha}\), that person is controlling the probability of making a Type I error.

  • Most common value are \({\alpha} = 0.05, 0.01\).
    • For example, a significance level of \({\alpha} = 0.05\) indicates a 5% risk of concluding that a difference exists when there is no actual difference.
    • Lower significance levels indicate that you require stronger evidence before you will reject the null hypothesis.
    • If the cost of making a Type I error is high, small values of \({\alpha}\) are preferred. Ex: \({\alpha} = 0.01\)
    • If the cost of making a Type I error is not too high, larger values of \({\alpha}\) are typically used. Ex: \(\alpha = 0.05\)
Definition 14.10 Applications of hypothesis testing that only control for the Type I error \((\alpha)\) are called significance tests.

Although most applications of hypothesis testing control for the probability of making a Type I error, they do not always control for the probability of making a Type II error. Hence, if we decide to accept \({H_0}\), we cannot determine how confident we can be with that decision. Because of the uncertainty associated with making a Type II error when conducting significance tests, statisticians usually recommend that we use the statement "do not reject \({H_0}\)" instead of “accept \({H_0}\).” Using the statement “do not reject \({H_0}\)” carries the recommendation to withhold both judgment and action. In effect, by not directly accepting \({H_0}\), the statistician avoids the risk of making a Type II error.

14.5.1 Additional

Refer figure 14.1

  1. Type I (\({\alpha}\)):
    • False Positive: Rejecting a True \({H_0}\) thus claiming False \({H_a}\)
    • An alpha error is when you mistakenly reject the Null and believe that something significant happened
      • i.e. you believe that the means of the two populations are different when they are not
      • i.e. you report that your findings are significant when in fact they have occurred by chance
    • The probability of making a type I error is represented by alpha level \({\alpha}\), which is the p-value below which you reject the null hypothesis
      • The p-value is the actual risk you have in being wrong if you reject the null
        • You would like that to be low
        • This p-value is compared with and should be lower than the alpha
        • A p-value of 0.05 indicates that you are willing to accept a 5% chance that you are wrong when you reject the null hypothesis. You can reduce your risk of committing a type I error by using a lower value for p. For example, a p-value of 0.01 would mean there is a 1% chance of committing a Type I error.
        • However, using a lower value for alpha means that you will be less likely to detect a true difference if one really exists (thus risking a type II error).
    • \({\alpha}\) is Significance Level (for \((1-{\alpha})\) confidence of not committing Type 1 error)
      • It is the boundary for specifying a statistically significant finding when interpreting the p-value
    • NOTE: Fail to reject True \({H_0}\) (\(\approx\) accept) is the correct decision shown in Top Left Quadrant
  2. Type II (\({\beta}\)):
    • False Negative: Failing to reject (\(\approx\) accept) a False \({H_0}\)
    • A beta error is when you fail to reject the null when you should have
      • i.e. you missed something significant and failed to take action
      • i.e. you conclude that there is not a significant effect, when actually there really is
      • You can decrease your risk of committing a type II error by ensuring your test has enough power.
      • You can do this by ensuring your sample size is large enough to detect a practical difference when one truly exists.

14.23 The probability of correctly rejecting \({H_0}\) when it is false is called the power of the test. For any particular value of \({\mu}\), the power is \(1 - \beta\).

  • The consequences of making a type I error mean that changes or interventions are made which are unnecessary, and thus waste time, resources, etc.
  • Type II errors typically lead to the preservation of the status quo (i.e. interventions remain the same) when change is needed.
  • Generally max 5% \({\alpha}\) and max 20% \({\beta}\) errors are recommended

14.6 Known SD

14.6.1 Test Statistic

Definition 14.11 Test statistic is a number calculated from a statistical test of a hypothesis. It shows how closely the observed data match the distribution expected under the null hypothesis of that statistical test. It helps determine whether a null hypothesis should be rejected.

10.4 The probability distribution for a random variable describes how probabilities are distributed over the values of the random variable.

The test statistic summarizes the observed data into a single number using the central tendency, variation, sample size, and number of predictor variables in the statistical model. Refer Table 14.1

Table 14.1: (C09V01) Test Statistic
Test statistic \({H_0}\) and \({H_a}\) Statistical tests that use it
t-value Null: The means of two groups are equal T-test, Regression tests
Alternative: The means of two groups are not equal
z-value Null: The means of two groups are equal Z-test
Alternative:The means of two groups are not equal
F-value Null: The variation among two or more groups is greater than or equal to the variation between the groups ANOVA, ANCOVA, MANOVA
Alternative: The variation among two or more groups is smaller than the variation between the groups
\({\chi}^2\text{-value}\) Null: Two samples are independent Chi-squared test, Non-parametric correlation tests
Alternative: Two samples are not independent (i.e. they are correlated)

14.6.2 Tails

8.14 A tail refers to the tapering sides at either end of a distribution curve.

Definition 14.12 A one-tailed test and a two-tailed test are alternative ways of computing the statistical significance of a parameter inferred from a data set, in terms of a test statistic.

One tailed-tests are concerned with one side of a statistic. Thus, one-tailed tests deal with only one tail of the distribution, and the z-score is on only one side of the statistic. Whereas, Two-tailed tests deal with both tails of the distribution, and the z-score is on both sides of the statistic.

In a one-tailed test, the area under the rejection region is equal to the level of significance, \({\alpha}\). When the rejection region is below the acceptance region, we say that it is a left-tail test. Similarly, when the rejection region is above the acceptance region, we say that it is a right-tail test.

In the two-tailed test, there are two critical regions, and the area under each region is \(\frac{\alpha}{2}\).

One-Tail vs. Two-Tail

  • One-tailed tests have more statistical power to detect an effect in one direction than a two-tailed test with the same design and significance level.
    • One-tailed tests occur most frequently for studies where one of the following is true:
      • Effects can exist in only one direction.
      • Effects can exist in both directions but the researchers only care about an effect in one direction.
  • The disadvantage of one-tailed tests is that they have no statistical power to detect an effect in the other direction.
    • Whereas, A two-tailed hypothesis test is designed to show whether the sample mean is significantly greater than OR significantly less than the mean of a population.
      • A two-tailed test is designed to examine both sides of a specified data range as designated by the probability distribution involved.
  • Thumb rule
    • Consider both directions when deciding if you should run a one tailed test or two. If you can skip one tail and it is not irresponsible or unethical to do so, then you can run a one-tailed test.
    • Two-tail test is done when you do not know about direction, so you test for both sides.

14.6.3 One-tailed Test

One-tailed tests about a population mean take one of the following two forms:

14.4 \(\text{\{Left Tail or Lower Tail\} } {H_0} : {\mu} \geq {\mu}_0 \iff {H_a}: {\mu} < {\mu}_0\)

14.5 \(\text{\{Right Tail or Upper Tail\} } {H_0} : {\mu} \leq {\mu}_0 \iff {H_a}: {\mu} > {\mu}_0\)

Definition 14.13 One-tailed test is a hypothesis test in which rejection of the null hypothesis occurs for values of the test statistic in one tail of its sampling distribution.

Example: The label on a can of Hilltop Coffee states that the can contains 3 pounds of coffee. As long as the population mean filling weight is at least 3 pounds per can, the rights of consumers will be protected. Thus, the government (FTC) interprets the label information on a large can of coffee as a claim by Hilltop that the population mean filling weight is at least 3 pounds per can.

  • Develop the null and alternative hypotheses for the test
    • \(H_0 : \mu \geq 3 \iff H_a: \mu < 3\)
  • Take a Sample
    • Suppose a sample of 36 cans of coffee is selected and the sample mean \({\overline{x}}\) is computed as an estimate of the population mean \({\mu}\). If the value of the sample mean \({\overline{x}}\) is less than 3 pounds, the sample results will cast doubt on the null hypothesis.
    • What we want to know is how much less than 3 pounds must \({\overline{x}}\) be before we would be willing to declare the difference significant and risk making a Type I error by falsely accusing Hilltop of a label violation. A key factor in addressing this issue is the value the decision maker selects for the level of significance.
  • Specify the level of significance \({\alpha}\)
    • FTC is willing to risk a 1% chance of making such an error i.e. \(\alpha 0.01\)
  • Compute the value of test statistic
    • Assume, known \(\sigma = 0.18\) and Normal distribution
    • Refer equation (12.1), standard error of \({\overline{x}}\) is \(\sigma_{\overline{x}} = \frac{\sigma}{\sqrt{n}} = \frac{0.18}{\sqrt{36}} = 0.03\)
    • Because the sampling distribution of \({\overline{x}}\) is normally distributed, \(z = \frac{\overline{x} - \mu_0}{\sigma_{\overline{x}}} = \frac{\overline{x} - 3}{0.03}\)
    • Because the sampling distribution of x is normally distributed, the sampling distribution of \({z}\) is a standard normal distribution.
    • A value of \(z = −1\) means that the value of \({\overline{x}}\) is one standard error below the hypothesized value of the mean. For a value of \(z = −2\), it would be two standard errors below the mean, and so on.
    • We can use the standard normal probability table to find the lower tail probability \({P_{\left(z\right)}}\) corresponding to any \({z}\) value. Refer Calculate P(z) by pnorm()
      • Ex: \(P_{\left(z = -3\right)} = 0.0013\)
      • As a result, the probability of obtaining a value of \({\overline{x}}\) that is 3 or more standard errors below the hypothesized population mean \(\mu_0=3\) is also 0.0013. i.e. Such a result is unlikely if the null hypothesis is true.
    • For hypothesis tests about a population mean in the \({\sigma}\) known case, we use the standard normal random variable \({z}\) as a test statistic to determine whether \({\overline{x}}\) deviates from the hypothesized value of \({\mu}\) enough to justify rejecting the null hypothesis. As given in equation (14.1)

\[z = \frac{\overline{x} - \mu_0}{\sigma_{\overline{x}}} = \frac{\overline{x} - \mu_0}{\sigma/\sqrt{n}} \tag{14.1}\]

The key question for a lower tail test is, How small must the test statistic \({z}\) be before we choose to reject the null hypothesis

Two approaches can be used to answer this: the p-value approach and the critical value approach.

14.6.3.1 p-value approach

Definition 14.14 The p-value approach uses the value of the test statistic \({z}\) to compute a probability called a p-value.
Definition 14.15 A p-value is a probability that provides a measure of the evidence against the null hypothesis provided by the sample. The p-value is used to determine whether the null hypothesis should be rejected. Smaller p-values indicate more evidence against \({H_0}\).

p-value (p) is the probability of obtaining a result equal to or more extreme than was observed in the data. It is the probability of observing the result given that the null hypothesis is true. A small p-value indicates the value of the test statistic is unusual given the assumption that \({H_0}\) is true.

For a lower tail test, the p-value is the probability of obtaining a value for the test statistic as small as or smaller than that provided by the sample. - we use the standard normal distribution to find the probability that \({z}\) is less than or equal to the value of the test statistic. - After computing the p-value, we must then decide whether it is small enough to reject the null hypothesis; this decision involves comparing the p-value to the level of significance.

For the Hilltop Coffee Example

  • Suppose the sample of 36 Hilltop coffee cans provides a sample mean of \({\overline{x}}\) = 2.92 pounds.
    • Is \(\overline{x} = 2.92\) small enough to cause us to reject \({H_0}\)
  • Because this is a lower tail test, the p-value is the area under the standard normal curve for values of \({z}\) less than or equal to the value of the test statistic.
    • Refer equation (14.1), \(z = \frac{\overline{x} - \mu_0}{\sigma/\sqrt{n}} = \frac{2.92 - 3}{0.18/\sqrt{36}} = -2.67\)
    • Thus, the p-value is the probability that \({z}\) is less than or equal to −2.67 (the lower tail area corresponding to the value of the test statistic).
  • Refer Calculate P(z) by pnorm(), to get the p-value
    • \(P_{\left(\overline{x} = 2.92\right)} = P_{\left(z = -2.67\right)} = 0.0038\)
    • This p-value does not provide much support for the null hypothesis, but is it small enough to cause us to reject \({H_0}\)
  • Compare p-value with Level of significance \(\alpha = 0.01\)
    • Because .0038 is less than or equal to \(\alpha = 0.01\), we reject \({H_0}\). Therefore, we find sufficient statistical evidence to reject the null hypothesis at the .01 level of significance.
    • We can conclude that Hilltop is underfilling the cans.

Rejection Rule: Reject \({H_0}\) if p-value \(\leq {\alpha}\)

Further, in this case, we would reject \({H_0}\) for any value of \({\alpha} \geq (p = 0.0038)\). For this reason, the p-value is also called the observed level of significance.

14.6.3.2 Critical value approach

Definition 14.16 The critical value approach requires that we first determine a value for the test statistic called the critical value.
Definition 14.17 Critical value is the value that is compared with the test statistic to determine whether \({H_0}\) should be rejected. Significance level \({\alpha}\), or confidence level (\(1 - {\alpha}\)), dictates the critical value (\(Z\)), or critical limit. Ex: For Upper Tail Test, \(Z_{{\alpha} = 0.05} = 1.645\).

For a lower tail test, the critical value serves as a benchmark for determining whether the value of the test statistic is small enough to reject the null hypothesis. - Critical value is the value of the test statistic that corresponds to an area of \({\alpha}\) (the level of significance) in the lower tail of the sampling distribution of the test statistic. - In other words, the critical value is the largest value of the test statistic that will result in the rejection of the null hypothesis.

Hilltop Coffee Example

  • The sampling distribution for the test statistic \({z}\) is a standard normal distribution.
    • Therefore, the critical value is the value of the test statistic that corresponds to an area of \(\alpha = 0.01\) in the lower tail of a standard normal distribution.
    • Using the standard normal probability table, we find that \(P_{\left(z\right)} = 0.01\) for \(z_{\alpha = 0.01} = −2.33\)
    • Refer For P(z), find z by qnorm()
    • Thus, if the sample results in a value of the test statistic that is less than or equal to −2.33, the corresponding p-value will be less than or equal to .01; in this case, we should reject the null hypothesis.
  • Compare test statistic with z-value
    • Because \((z = -2.67) < (z_{\alpha = 0.01} = −2.33)\), we can reject \({H_0}\)
    • We can conclude that Hilltop is underfilling the cans.

Rejection Rule: Reject \({H_0}\) if \(z \leq z_{\alpha}\)

14.6.3.3 Summary

The p-value approach to hypothesis testing and the critical value approach will always lead to the same rejection decision; that is, whenever the p-value is less than or equal to \({\alpha}\), the value of the test statistic will be less than or equal to the critical value.

  • The advantage of the p-value approach is that the p-value tells us how significant the results are (the observed level of significance).
    • If we use the critical value approach, we only know that the results are significant at the stated level of significance.

For upper tail test The test statistic \({z}\) is still computed as earlier. But, for an upper tail test, the p-value is the probability of obtaining a value for the test statistic as large as or larger than that provided by the sample. Thus, to compute the p-value for the upper tail test in the \({\sigma}\) known case, we must use the standard normal distribution to find the probability that \({z}\) is greater than or equal to the value of the test statistic. Using the critical value approach causes us to reject the null hypothesis if the value of the test statistic is greater than or equal to the critical value \(z_{\alpha}\); in other words, we reject \({H_0}\) if \(z \geq z_{\alpha}\).

14.6.3.4 Acceptance and Rejection Region

13.1 Because a point estimator cannot be expected to provide the exact value of the population parameter, an interval estimate is often computed by adding and subtracting a value, called the margin of error, to the point estimate. \(\text{Interval Estimate} = \text{Point Estimate} \pm \text{Margin of Error}\)

Definition 14.18 A acceptance region (confidence interval), is a set of values for the test statistic for which the null hypothesis is accepted. i.e. if the observed test statistic is in the confidence interval then we accept the null hypothesis and reject the alternative hypothesis.

\[Z=\frac {\overline {X}-\mu }{\sigma/{\sqrt{n}}} \quad \Leftrightarrow \mu = \overline {X} - Z \frac{\sigma}{\sqrt{n}} \quad \Rightarrow \mu = \overline {X} \pm Z \frac{\sigma}{\sqrt{n}} \quad \Rightarrow \mu \approx \overline {X} \pm Z \frac{s}{\sqrt{n}} \tag{14.2}\]

Definition 14.19 The margin of error tells how far the original population means might be from the sample mean. It is given by \(Z\frac{\sigma}{\sqrt{n}}\)
Definition 14.20 A rejection region (critical region), is a set of values for the test statistic for which the null hypothesis is rejected. i.e. if the observed test statistic is in the critical region then we reject the null hypothesis and accept the alternative hypothesis.

14.6.4 Two-tailed Test

14.6 \(\text{\{Two Tail\} } {H_0} :{\mu} = {\mu}_0 \iff {H_a}: {\mu} \neq {\mu}_0\)

Definition 14.21 Two-tailed test is a hypothesis test in which rejection of the null hypothesis occurs for values of the test statistic in either tail of its sampling distribution.

Ex: Golf Company, mean driving distance is 295 yards i.e. \((\mu_0 = 295)\)

  • \(H_0 : \mu = 295 \iff H_a: \mu \neq 295\)

  • The quality control team selected \(\alpha = 0.05\) as the level of significance for the test.

  • From previous tests, assume known \(\sigma = 12\)

  • For a sample size \(n = 50\)

    • Standard Error of \({\overline{x}}\) is \(\sigma_{\overline{x}} = \frac{\sigma}{\sqrt{n}} = \frac{12}{\sqrt{50}} = 1.7\)
    • Central Limit Theorem, allows us to conclude that the sampling distribution of \({\overline{x}}\) can be approximated by a normal distribution.
  • Suppose for the sample, \(\overline{x} = 297.6\)

  • p-value approach

    • For a two-tailed test, the p-value is the probability of obtaining a value for the test statistic as unlikely as or more unlikely than that provided by the sample.
    • Refer equation (14.1), \(z = \frac{\overline{x} - \mu_0}{\sigma/\sqrt{n}} = \frac{297.6 - 295}{12/\sqrt{50}} = 1.53\)
    • Now to compute the p-value we must find the probability of obtaining a value for the test statistic at least as unlikely as \(z = 1.53\).
      • Clearly values of \(z \geq 1.53\) are at least as unlikely.
      • But, because this is a two-tailed test, values of \(z \leq −1.53\) are also at least as unlikely as the value of the test statistic provided by the sample.
    • Refer Calculate P(z) by pnorm(), to get the p-value
      • \(P_{\left(z\right)} = P_{\left(z \leq -1.53\right)} + P_{\left(z \geq 1.53\right)}\)
      • \(P_{\left(z\right)} = 2 \times P_{\left(z \geq 1.53\right)}\), Because the normal curve is symmetric
      • \(P_{\left(z\right)} = 2 \times 0.0630 = 0.1260\)
    • Compare p-value with Level of significance \(\alpha = 0.05\)
      • We do not reject \({H_0}\) because the \((\text{p-value}= 0.1260) > (\alpha = 0.05)\)
      • Because the null hypothesis is not rejected, no action will be taken.
  • critical value approach

    • The critical values for the test will occur in both the lower and upper tails of the standard normal distribution.
    • With a level of significance of \(\alpha = 0.05\), the area in each tail corresponding to the critical values is \(\alpha/2 = 0.025\).
    • Refer For P(z), find z by qnorm()
      • Using the standard normal probability table, we find that \(P_{\left(z\right)} = 0.025\) for \(-z_{\alpha/2 = 0.025} = −1.96\) and \(z_{\alpha/2 = 0.025} = 1.96\)
    • Compare test statistic with z-value
      • Because \((z = 1.53)\) is NOT greater than \((z_{\alpha/2 = 0.025} = 1.96)\), we can NOT reject \({H_0}\)

Rejection Rule: Reject \({H_0}\) if \(z \leq -z_{\alpha/2}\) or \(z \geq z_{\alpha/2}\)

(Online, might be wrong) Ex: Assume that for a Population with mean \({\mu}\) unknown and standard deviation \({\sigma} = 15\), if we take a sample \({n = 100}\) its sample mean is \({\overline{x}} = 42\).

Assume \({\alpha} = 0.05\) and if we are conducting a Two Tail Test, \(Z_{\alpha/2=0.05/2} = 1.960\)

  • If we take a different sample of same size or a sample of different size, the sample mean calculated for those would be different.
  • So, our sample mean \({\overline{x}}\) might not be the true population mean \({\mu}\)
  • Thus, a range is inferred using the sample size, the sample mean, and the population standard deviation, and it is assumed that the true population means falls under this interval. This interval is called a confidence interval.
  • Confidence interval is calculated using critical limit \({z}\), and thus are calculated for specific significance level \({\alpha}\)
  • Margin of Error \(= Z\frac{\sigma}{\sqrt{n}} = 1.96 \times 15 /\sqrt{100} = 2.94\)

As shown in the equation (14.2), our interval range is \(\mu = \overline {X} \pm 2.94 = 42 \pm 2.94 \rightarrow \mu \in (39.06, 44.94)\)

We are 95% confident that the population mean will be between 39.04 and 44.94

Note that a 95% confidence interval does not mean there is a 95% chance that the true value being estimated is in the calculated interval. Rather, given a population, there is a 95% chance that choosing a random sample from this population results in a confidence interval which contains the true value being estimated.

14.7 Steps of Hypothesis Testing

Common Steps

  1. Develop the null and alternative hypotheses.
  2. Specify the level of significance.
  3. Collect the sample data and compute the value of the test statistic.

p-Value Approach Step

  1. Use the value of the test statistic to compute the p-value.
  2. Reject \({H_0}\) if the p-value \(\leq {\alpha}\).
  3. Interpret the statistical conclusion in the context of the application.
Definition 14.22 p-value Approach: Form Hypothesis | Specify \({\alpha}\) | Calculate test statistic | Calculate p-value | Compare p-value with \({\alpha}\) | Interpret

Critical Value Approach

  1. Use the level of significance to determine the critical value and the rejection rule.
  2. Use the value of the test statistic and the rejection rule to determine whether to reject \({H_0}\).
  3. Interpret the statistical conclusion in the context of the application.

14.8 Relationship Between Interval Estimation and Hypothesis Testing

Refer equation (13.1), For the \({\sigma}\) known case, the \({(1 - \alpha)}\%\) confidence interval estimate of a population mean is given by

\[\overline{x} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}}\]

We know that \(100 {(1 - \alpha)}\%\) of the confidence intervals generated will contain the population mean and \(100 {\alpha}\%\) of the confidence intervals generated will not contain the population mean.

Thus, if we reject \({H_0}\) whenever the confidence interval does not contain \({\mu}_0\), we will be rejecting the null hypothesis when it is true \((\mu = {\mu}_0)\) with probability \({\alpha}\).

The level of significance is the probability of rejecting the null hypothesis when it is true. So constructing a \(100 {(1 - \alpha)}\%\) confidence interval and rejecting \({H_0}\) whenever the interval does not contain \({\mu}_0\) is equivalent to conducting a two-tailed hypothesis test with \({\alpha}\) as the level of significance.

Ex: Golf company

  • For \({\alpha} =0.05\), 95% confidence interval estimate of the population mean is
    • \({\overline{x}} \pm z_{0.025} \frac{{\sigma}}{\sqrt{n}} = 297.6 \pm 1.96 \frac{12}{\sqrt{50}} = 297.6 \pm 3.3\)
    • Interval: \([294.3, 300.9]\)
    • We can conclude with 95% confidence that the mean distance for the population of golf balls is between 294.3 and 300.9 yards.
    • Because the hypothesized value for the population mean, \({\mu}_0 = 295\), is in this interval, the hypothesis testing conclusion is that the null hypothesis, \({H_0: {\mu} = 295}\), cannot be rejected.

“ForLater” - Exercises

14.9 Unknown SD

For the \({\sigma}\) unknown case, the sampling distribution of the test statistic follows the t distribution with \((n − 1)\) degrees of freedom. Refer equation (14.3)

\[t = \frac{\overline{x} - \mu_0}{s/\sqrt{n}} \tag{14.3}\]

One-Tailed Test

  • Ex: Heathrow Airport, testing for mean rating 7 i.e. \({\mu}_0 = 7\)
    • \({H_0}: {\mu} \leq 7 \iff {H_a} \geq 7\)
    • Sample: \({\overline{x}} = 7.25, s = 1.052, n = 60\)
    • \({\alpha} = 0.05\)
    • Refer equation (14.3), \(t = \frac{\overline{x} - \mu_0}{s/\sqrt{n}} = \frac{7.25 - 7}{1.052/\sqrt{60}} = 1.84\)
    • \(\text{DOF} = n-1 = 60 -1 = 59\)
    • Refer For P(t), find t by qt() and This is a Right Tail Test
      • \({P_{\left(t \geq 1.84\right)}} = 0.0354\) i.e. between 0.05 and 0.025
    • Comparison
      • \({(P_{\left(t \geq 1.84\right)}} = 0.035) < ({\alpha} = 0.05)\)
      • Thus, we can reject the \({H_0}\) and can accept the \({H_a}\)

Critical Value Approach - \((\text{DOF = 59}), \,t_{{\alpha} = 0.05} = 1.671\) - Because \((t = 1.84) > (t_{{\alpha} = 0.05} = 1.671)\), Reject \({H_0}\)

# #Like pnorm() is for P(z) and qnorm() is for z, pt() is for P(t) and qt() is for t.
#
# #p-value approach: Find Commulative Probability P corresponding to the given t-value & DOF=59
pt(q = 1.84, df = 59, lower.tail = FALSE)
## [1] 0.03539999
#
# #Critical Value: t-value for which Area under the curve towards Right is alpha=0.05 & DOF=59
qt(p = 0.05, df = 59, lower.tail = FALSE)
## [1] 1.671093

Two Tailed Test

  • Ex: Holiday Toys, testing for sale of 40 units, i.e. \({\mu}_0 = 40\)
    • \({H_0}: {\mu} = 40 \iff {H_a} \neq 40\)
    • Sample: \({\overline{x}} = 37.4, s = 11.79, n = 25\)
    • \({\alpha} = 0.05\)
    • Refer equation (14.3), \(t = \frac{\overline{x} - \mu_0}{s/\sqrt{n}} = \frac{37.4 - 40}{11.79/\sqrt{25}} = -1.10\)
    • \(\text{DOF} = n-1 = 25 -1 = 24\)
    • Because we have a two-tailed test, the p-value is two times the area under the curve of the t distribution for \(t \leq -1.10\)
      • \(P_{\left(t\right)} = P_{\left(t \leq -1.10\right)} + P_{\left(z \geq 1.10\right)}\)
      • \(P_{\left(t\right)} = 2 \times P_{\left(t \leq -1.10\right)}\), Because the normal curve is symmetric
      • \(P_{\left(t\right)} = 2 \times 0.1411 = 0.2822\) i.e. between 2 * (0.20 and 0.10) or (0.40, 0.20)
    • Comparison
      • \((P_{\left(t\right)} = 0.282)2 > ({\alpha} = 0.05)\)
      • Thus, we can NOT reject the \({H_0}\)

Critical Value Approach - \((\text{DOF = 24})\) - We find that \(P_{\left(t\right)} = 0.025\) for \(-t_{\alpha/2 = 0.025} = -2.064\) and \(t_{\alpha/2 = 0.025} = 2.064\) - Compare test statistic with z-value - Because \((t = -1.10)\) is NOT lower than \((-z_{\alpha/2 = 0.025} = -2.064)\), we can NOT reject \({H_0}\)

14.10 Hypothesis Testing and Decision Making

If the purpose of a hypothesis test is to make a decision when \({H_0}\) is true and a different decision when \({H_a}\) is true, the decision maker may want to, and in some cases be forced to, take action with both the conclusion do not reject \({H_0}\) and the conclusion reject \({H_0}\).

If this situation occurs, statisticians generally recommend controlling the probability of making a Type II error. With the probabilities of both the Type I and Type II error controlled, the conclusion from the hypothesis test is either to accept \({H_0}\) or reject \({H_0}\). In the first case, \({H_0}\) is concluded to be true, while in the second case, \({H_a}\) is concluded true. Thus, a decision and appropriate action can be taken when either conclusion is reached.

“ForLater” - Calculate \({\beta}\)

When the true population mean \({\mu}\) is close to the null hypothesis value of \({\mu} = 120\), the probability is high that we will make a Type II error. However, when the true population mean \({\mu}\) is far below the null hypothesis value of \({\mu} = 120\), the probability is low that we will make a Type II error.

Definition 14.23 The probability of correctly rejecting \({H_0}\) when it is false is called the power of the test. For any particular value of \({\mu}\), the power is \(1 − \beta\).
Definition 14.24 Power Curve is a graph of the probability of rejecting \({H_0}\) for all possible values of the population parameter \({\mu}\) not satisfying the null hypothesis. It provides the probability of correctly rejecting the null hypothesis.

Note that the power curve extends over the values of \({\mu}\) for which the null hypothesis is false. The height of the power curve at any value of \({\mu}\) indicates the probability of correctly rejecting \({H_0}\) when \({H_0}\) is false.

14.11 Summary

We can make 3 observations about the relationship among \({\alpha}, \beta, n (\text{sample size})\).

  1. Once two of the three values are known, the other can be computed.
  2. For a given level of significance \({\alpha}\), increasing the sample size will reduce \({\beta}\).
  3. For a given sample size, decreasing \({\alpha}\) will increase \({\beta}\), whereas increasing \({\alpha}\) will decrease \({\beta}\).

Validation


15 Two Populations

15.1 Overview

  • “Inference About Means and Proportions with Two Populations”
    • “ForLater”

15.2 Summary

How interval estimates and hypothesis tests can be developed for situations involving two populations when the difference between the two population means or the two population proportions is of prime importance.

Example

  • To develop an interval estimate of the difference between the mean starting salary for a population of men and the mean starting salary for a population of women.

  • To conduct a hypothesis test to determine whether any difference is present between the proportion of defective parts in a population of parts produced by supplier A and the proportion of defective parts in a population of parts produced by supplier B.

  • The matched sample design is generally preferred to the independent sample design because the matched-sample procedure often improves the precision of the estimate.

Validation


16 Variance

16.1 Overview

  • “Inferences About Population Variances”
    • “ForLater” - Hypothesis Testing, Inferences About Two Population Variances

16.2 Inferences About a Population Variance

In many manufacturing applications, controlling the process variance is extremely important in maintaining quality.

The sample variance \({s^2}\), given by equation (8.11), is the point estimator of the population variance \({\sigma}^2\).

\[s^2 = \frac{\sum {(x_i - {\overline{x}})}^2}{n-1} \tag{8.11}\]

Definition 16.1 Whenever a simple random sample of size \({n}\) is selected from a normal population, the sampling distribution of \(\frac{(n-1)s^2}{{\sigma}^2}\) is a chi-square distribution with \({n − 1}\) degrees of freedom.

Note:

  • The chi-square distribution is based on sampling from a normal population.
  • It can be used to develop interval estimates and conduct hypothesis tests about a population variance.
  • The notation \({\chi_{\alpha}^2}\) denotes the value for the chi-square distribution that provides an area or probability of \({\alpha}\) to the right of the \({\chi_{\alpha}^2}\) value.

Interval Estimation

Example: Asample of 20 containers \({n =20}\) has the sample variance \({s^2} = 0.0025\)

  • DOF = 19
  • For (DOF = 19), \({\chi_{\alpha = 0.025}^2} = 32.852\), indicating that 2.5% of the chi-square values are to the right of 32.852, and \({\chi_{\alpha = 0.975}^2} = 8.907\) indicating that 97.5% of the chi-square values are to the right of 8.907.
  • For (DOF = 19), 95% of the chi-square values are between \({\chi_{\alpha = 0.975}^2}\) and \({\chi_{\alpha = 0.025}^2}\)
    • There is a .95 probability of obtaining a \({\chi^2}\) value such that \({\chi_{0.975}^2} \leq {\chi^2} \leq{\chi_{0.025}^2}\)
# #pnorm() qnorm() | pt() qt() | pchisq() qchisq()
#
# #p-value approach: Find Commulative Probability P corresponding to the given ChiSq & DOF=59
pchisq(q = 32.852, df = 19, lower.tail = FALSE)
## [1] 0.02500216
#
# #ChiSq value for which Area under the curve towards Right is alpha=0.025 & DOF=19 #32.852
qchisq(p = 0.025, df = 19, lower.tail = FALSE)
## [1] 32.85233
  • Using equation (8.11), we can get (16.1), which provides a 95% confidence interval estimate for the population variance \({\sigma}^2\).

\[\frac{(n-1)s^2}{{\chi_{0.025}^2}} \leq {\sigma}^2 \leq \frac{(n-1)s^2}{{\chi_{0.975}^2}} \tag{16.1}\]

  • Using values from the example, \(0.0014 \leq {\sigma}^2 \leq 0.0053 \Rightarrow 0.0380 \leq {\sigma} \leq 0.0730\)
    • which gives the 95% confidence interval for the population standard deviation

Generalising the equation (16.1), the equation (16.2) is the interval estimate of a population variance.

\[\frac{(n-1)s^2}{{\chi_{{\alpha}/2}^2}} \leq {\sigma}^2 \leq \frac{(n-1)s^2}{{\chi_{{1-\alpha}/2}^2}} \tag{16.2}\]

where the \({\chi^2}\) values are based on a chi-square distribution with \({n-1}\) degrees of freedom and where \((1 − {\alpha})\) is the confidence coefficient.

“ForLater”

  • Hypothesis Testing
  • Inferences About Two Population Variances
Definition 16.2 The F distribution is based on sampling from two normal populations.

Validation


17 Independence

17.1 Overview

  • “Comparing Multiple Proportions, Test of Independence and Goodness of Fit”
    • “ForLater” - Everything

17.2 Summary

Hypothesis-testing procedures that expand our capacity for making statistical inferences about populations

  • The test statistic used in conducting the hypothesis tests in this chapter is based on the chi-square \({\chi^2}\) distribution.
  • In all cases, the data are categorical.
  • Applications
    • Testing the equality of population proportions for three or more populations
    • Testing the independence of two categorical variables
    • Testing whether a probability distribution for a population follows a specific historical or theoretical probability distribution

All tests apply to categorical variables and all tests use a chi-square \({\chi^2}\) test statistic that is based on the differences between observed frequencies and expected frequencies. In each case, expected frequencies are computed under the assumption that the null hypothesis is true. These chi-square tests are upper tailed tests. Large differences between observed and expected frequencies provide a large value for the chi-square test statistic and indicate that the null hypothesis should be rejected.

The test for the equality of population proportions for three or more populations is based on independent random samples selected from each of the populations. The sample data show the counts for each of two categorical responses for each population. The null hypothesis is that the population proportions are equal. Rejection of the null hypothesis supports the conclusion that the population proportions are not all equal.

The test of independence between two categorical variables uses one sample from a population with the data showing the counts for each combination of two categorical variables. The null hypothesis is that the two variables are independent and the test is referred to as a test of independence. If the null hypothesis is rejected, there is statistical evidence of an association or dependency between the two variables.

The goodness of fit test is used to test the hypothesis that a population has a specific historical or theoretical probability distribution. We showed applications for populations with a multinomial probability distribution and with a normal probability distribution. Since the normal probability distribution applies to continuous data, intervals of data values were established to create the categories for the categorical variable required for the goodness of fit test.

Validation


18 ANOVA

18.1 Overview

  • “Experimental Design and Analysis of Variance”
    • “ForLater” - Everything

18.2 Summary

Analysis of variance (ANOVA) can be used to test for differences among means of several populations or treatments.

The completely randomized design and the randomized block design are used to draw conclusions about differences in the means of a single factor. The primary purpose of blocking in the randomized block design is to remove extraneous sources of variation from the error term. Such blocking provides a better estimate of the true error variance and a better test to determine whether the population or treatment means of the factor differ significantly.

The basis for the statistical tests used in analysis of variance and experimental design is the development of two independent estimates of the population variance \({\sigma}^2\). In the single-factor case, one estimator is based on the variation between the treatments; this estimator provides an unbiased estimate of \({\sigma}^2\) only if the means \(\{{\mu}_1, {\mu}_2, \ldots, {\mu}_k\}\) are all equal. A second estimator of \({\sigma}^2\) is based on the variation of the observations within each sample; this estimator will always provide an unbiased estimate of \({\sigma}^2\).

By computing the ratio of these two estimators (the F statistic), it is determined whether to reject the null hypothesis that the population or treatment means are equal.

In all the experimental designs considered, the partitioning of the sum of squares and degrees of freedom into their various sources enabled us to compute the appropriate values for the analysis of variance calculations and tests.

Further, Fisher;s LSD procedure and the Bonferroni adjustment can be used to perform pairwise comparisons to determine which means are different.

Validation


19 Simple Linear Regression

19.1 Overview

    • “ForLater” - Everything

Simple Linear Regression Model

Definition 19.1 Regression analysis can be used to develop an equation showing how two or more variables variables are related.
Definition 19.2 The variable being predicted is called the dependent variable \(({y})\).
Definition 19.3 The variable or variables being used to predict the value of the dependent variable are called the independent variables \(({x})\).
Definition 19.4 The simplest type of regression analysis involving one independent variable and one dependent variable in which the relationship between the variables is approximated by a straight line, is called simple linear regression.
Definition 19.5 The equation that describes how \({y}\) is related to \({x}\) and an error term is called the regression model. For example, simple linear regression model is given by equation (19.1)

\[{y} = {\beta}_0 + {\beta}_1 {x} + {\epsilon} \tag{19.1}\]

Note

  • \({\beta}_0\) and \({\beta}_1\) are referred to as the parameters of the model
  • The random variable, error term \(({\epsilon})\), accounts for the variability in \({y}\) that cannot be explained by the linear relationship between \({x}\) and \({y}\).

19.2 Summary

Validation


20 Multiple Regression

20.1 Overview

    • “ForLater” - Everything

20.2 Summary

Multiple regression analysis enables us to understand how a dependent variable is related to two or more independent variables.

Validation


21 Regression Models

21.1 Overview

    • “ForLater” - Everything

21.2 Summary

Validation


22 Time Series

22.1 Overview

  • “Time Series Analysis and Forecasting”
    • “ForLater” - Everything

22.2 Summary

Validation


23 Nonparametric Methods

23.1 Overview

23.2 Paramtetric Methods

Definition 23.1 Parametric methods are the statistical methods that begin with an assumption about the probability distribution of the population which is often that the population has a normal distribution. A sampling distribution for the test statistic can then be derived and used to make an inference about one or more parameters of the population such as the population mean \({\mu}\) or the population standard deviation \({\sigma}\).

Parametric methods mostly require quantitative data. However these are generally sometimes more powerful than nonparametric methods.

  • The reason that parametric tests are sometimes more powerful than randomisation and tests based on ranks is that the parametric tests make use of some extra information about the data: the nature of the distribution from which the data are assumed to have come.
  • Powerful here means, they require smaller sample size.
  • However, their power advantage is not invariant
  • Further, Rarely if ever a parametric test and a non-parametric test actually have the same null.
    • The parametric t-test is testing the mean of the distribution, assuming the first two moments exist.
    • The Wilcoxon rank sum test does not assume any moments, and tests equality of distributions instead.
    • The two tests are testing different hypotheses (comparable in a limited sense but different).
  • At large sample sizes, either of the parametric or the nonparametric tests work adequately.

23.3 Nonparamtetric Methods

Definition 23.2 Distribution-free methods are the Statistical methods that make no assumption about the probability distribution of the population.
Definition 23.3 Nonparametric methods are the statistical methods that require no assumption about the form of the probability distribution of the population and are often referred to as distribution free methods. Several of the methods can be applied with categorical as well as quantitative data.

Most of the statistical methods referred to as parametric methods require quantitative data, while nonparametric methods allow inferences based on either categorical or quantitative data.

  • However, the computations used in the nonparametric methods are generally done with categorical data.
    • Nominal or ordinal measures in many cases require a nonparametric test.
  • Whenever the data are quantitative, we will transform the data into categorical data in order to conduct the nonparametric test.
  • Most nonparametric tests use some way of ranking the measurements.
  • Nonparametric tests are used in cases where parametric tests are not appropriate.
    • Nonparametric tests are often necessary, specially when the distribution is not normal (skewness), the distribution is not known, or the sample size is too small (<30) to assume a normal distribution.
    • Also, if there are extreme values or values that are clearly “out of range” nonparametric tests should be used.

23.4 Summary

Validation


24 Quality Control

24.1 Overview

  • “Statistical Methods for Quality Control”
    • “ForLater” - Everything

24.2 Summary

Validation


25 Index Numbers

25.1 Overview

    • “ForLater” - Everything

25.2 Summary

Validation


References

David R. Anderson, Thomas A. Williams, Dennis J. Sweeney. 2018. Statistics for Business and Economics. Revised 13e. Boston, MA 02210 USA: Cengage Learning. https://www.cengage.com.

Glossary

THEOREMS

DEFINITIONS

1.1: Vectors

Vectors are the simplest type of data structure in R. A vector is a sequence of data elements of the same basic type.

1.2: Components

Members of a vector are called components.

1.3: Packages

Packages are the fundamental units of reproducible R code.

2.1: R-Markdown

R Markdown is a file format for making dynamic documents with R.

2.2: NA

NA is a logical constant of length 1 which contains a missing value indicator.

2.3: Factors

Factors are the data objects which are used to categorize the data and store it as levels.

2.4: Lists

Lists are by far the most flexible data structure in R. They can be seen as a collection of elements without any restriction on the class, length or structure of each element.

2.5: DataFrame

Data Frames are lists with restriction that all elements of a data frame are of equal length.

6.1: Data

Data are the facts and figures collected, analysed, and summarised for presentation and interpretation.

6.2: Elements

Elements are the entities on which data are collected. (Generally ROWS)

6.3: Variable

A variable is a characteristic of interest for the elements. (Generally COLUMNS)

6.4: Observation

The set of measurements obtained for a particular element is called an observation.

6.5: Statistics

Statistics is the art and science of collecting, analysing, presenting, and interpreting data.

6.6: Scale-of-Measurement

The scale of measurement determines the amount of information contained in the data and indicates the most appropriate data summarization and statistical analyses.

6.7: Nominal-Scale

When the data for a variable consist of labels or names used to identify an attribute of the element, the scale of measurement is considered a nominal scale.

6.8: Ordinal-Scale

The scale of measurement for a variable is considered an ordinal scale if the data exhibit the properties of nominal data and in addition, the order or rank of the data is meaningful.

6.9: Interval-Scale

The scale of measurement for a variable is an interval scale if the data have all the properties of ordinal data and the interval between values is expressed in terms of a fixed unit of measure.

6.10: Ratio-Scale

The scale of measurement for a variable is a ratio scale if the data have all the properties of interval data and the ratio of two values is meaningful.

6.11: Categorical-Data

Data that can be grouped by specific categories are referred to as categorical data. Categorical data use either the nominal or ordinal scale of measurement.

6.12: Quantitative-Data

Data that use numeric values to indicate ‘how much’ or ‘how many’ are referred to as quantitative data. Quantitative data are obtained using either the interval or ratio scale of measurement.

6.13: Discrete

Quantitative data that measure ‘how many’ are discrete.

6.14: Continuous

Quantitative data that measure ‘how much’ are continuous because no separation occurs between the possible data values.

6.15: Cross-Sectional-Data

Cross-sectional data are data collected at the same or approximately the same point in time.

6.16: Time-Series-Data

Time-series data data are data collected over several time periods.

6.17: Observational-Study

In an observational study we simply observe what is happening in a particular situation, record data on one or more variables of interest, and conduct a statistical analysis of the resulting data.

6.18: Experiment

The key difference between an observational study and an experiment is that an experiment is conducted under controlled conditions.

6.19: Descriptive-Statistics

Most of the statistical information is summarized and presented in a form that is easy to understand. Such summaries of data, which may be tabular, graphical, or numerical, are referred to as descriptive statistics.

6.20: Population

A population is the set of all elements of interest in a particular study.

6.21: Sample

A sample is a subset of the population.

6.22: Parameter-vs-Statistic

The measurable quality or characteristic is called a Population Parameter if it is computed from the population. It is called a Sample Statistic if it is computed from a sample.

6.23: Census

The process of conducting a survey to collect data for the entire population is called a census.

6.24: Sample-Survey

The process of conducting a survey to collect data for a sample is called a sample survey.

6.25: Statistical-Inference

Statistics uses data from a sample to make estimates and test hypotheses about the characteristics of a population through a process referred to as statistical inference.

6.26: Analytics

Analytics is the scientific process of transforming data into insight for making better decisions.

6.27: Descriptive-Analytics

Descriptive analytics encompasses the set of analytical techniques that describe what has happened in the past.

6.28: Predictive-Analytics

Predictive analytics consists of analytical techniques that use models constructed from past data to predict the future or to assess the impact of one variable on another.

6.29: Prescriptive-Analytics

Prescriptive analytics is the set of analytical techniques that yield a best course of action.

6.30: Big-Data

Larger and more complex data sets are now often referred to as big data.

6.31: Data-Mining

Data Mining deals with methods for developing useful decision-making information from large databases. It can be defined as the automated extraction of predictive information from (large) databases.

7.1: Frequency-Distribution

A frequency distribution is a tabular summary of data showing the number (frequency) of observations in each of several non-overlapping categories or classes.

7.2: Cross-Tab

A crosstabulation is a tabular summary of data for two variables. It is used to investigate the relationship between them. Generally, one of the variable is categorical.

8.1: Number

A number is a mathematical object used to count, measure, and label. Their study or usage is called arithmetic, a term which may also refer to number theory, the study of the properties of numbers.

8.2: Prime

A prime number is a natural number greater than 1 that is not a product of two smaller natural numbers. A natural number greater than 1 that is not prime is called a ‘composite number.’ 1 is neither a Prime nor a composite, it is a ‘Unit.’ Thus, by definition, Negative Integers and Zero cannot be Prime.

8.3: Parity-Odd-Even

Parity is the property of an integer \(\mathbb{Z}\) of whether it is even or odd. It is even if the integer is divisible by 2 with no remainders left and it is odd otherwise. Thus, -2, 0, +2 are even but -1, 1 are odd. Numbers ending with 0, 2, 4, 6, 8 are even. Numbers ending with 1, 3, 5, 7, 9 are odd.

8.4: Positive-Negative

An integer \(\mathbb{Z}\) is positive if it is greater than zero, and negative if it is less than zero. Zero is defined as neither negative nor positive.

8.5: Mersenne-Primes

Mersenne primes are those prime number that are of the form \((2^n -1)\); that is, \(\{3, 7, 31, 127, \ldots \}\)

8.6: Mean

Given a data set \({X=\{x_1,x_2,\ldots,x_n\}}\), the mean \({\overline{x}}\) is the sum of all of the values \({x_1,x_2,\ldots,x_n}\) divided by the count \({n}\).

8.7: Median

Median of a population is any value such that at most half of the population is less than the proposed median and at most half is greater than the proposed median.

8.8: Geometric-Mean

The geometric mean \(\overline{x}_g\) is a measure of location that is calculated by finding the n^{th} root of the product of \({n}\) values.

8.9: Mode

The mode is the value that occurs with greatest frequency.

8.10: Percentile

A percentile provides information about how the data are spread over the interval from the smallest value to the largest value. For a data set containing \({n}\) observations, the \(p^{th}\) percentile divides the data into two parts: approximately p% of the observations are less than the \(p^{th}\) percentile, and approximately (100 – p)% of the observations are greater than the \(p^{th}\) percentile.

8.11: Variance

The variance \(({\sigma}^2)\) is based on the difference between the value of each observation \({x_i}\) and the mean \({\overline{x}}\). The average of the squared deviations is called the variance.

8.12: Standard-Deviation

The standard deviation (\(s, \sigma\)) is defined to be the positive square root of the variance. It is a measure of the amount of variation or dispersion of a set of values.

8.13: Skewness

Skewness \((\tilde{\mu}_{3})\) is a measure of the shape of a data distribution. It is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean.

8.14: Tails

A tail refers to the tapering sides at either end of a distribution curve.

8.15: Kurtosis

Kurtosis \((\tilde{\mu}_{4})\) is a measure of the “tailedness” of the probability distribution of a real-valued random variable. Like skewness, kurtosis describes the shape of a probability distribution. For \({\mathcal {N}}_{(\mu,\, \sigma)}\), kurtosis is 3 and excess kurtosis is 0 (i.e. subtract 3).

8.16: TheSample

A sample of \({n}\) observations given by \({X=\{x_1,x_2,\ldots,x_n\}}\) have a sample mean \({\overline{x}}\) and the sample standard deviation, \({s}\).

8.17: z-Scores

The z-score, \({z_i}\), can be interpreted as the number of standard deviations \({x_i}\) is from the mean \({\overline{x}}\). It is associated with each \({x_i}\). The z-score is often called the standardized value or standard score.

8.18: t-statistic

Computing a z-score requires knowing the mean \({\mu}\) and standard deviation \({\sigma}\) of the complete population to which a data point belongs. If one only has a sample of observations from the population, then the analogous computation with sample mean \({\overline{x}}\) and sample standard deviation \({s}\) yields the t-statistic.

8.19: Chebyshev-Theorem

Chebyshev Theorem can be used to make statements about the proportion of data values that must be within a specified number of standard deviations \({\sigma}\), of the mean \({\mu}\).

8.20: Empirical-Rule

Empirical rule is used to compute the percentage of data values that must be within one, two, and three standard deviations \({\sigma}\) of the mean \({\mu}\) for a normal distribution. These probabilities are Pr(x) 68.27%, 95.45%, and 99.73%.

8.21: Outliers

Sometimes unusually large or unusually small values are called outliers. It is a data point that differs significantly from other observations.

8.22: Covariance

Covariance is a measure of linear association between two variables. Positive values indicate a positive relationship; negative values indicate a negative relationship.

8.23: Correlation-Coefficient

Correlation coefficient is a measure of linear association between two variables that takes on values between -1 and +1. Values near +1 indicate a strong positive linear relationship; values near -1 indicate a strong negative linear relationship; and values near zero indicate the lack of a linear relationship.

9.1: Probability

Probability is a numerical measure of the likelihood that an event will occur. Probability values are always assigned on a scale from 0 to 1. A probability near zero indicates an event is unlikely to occur; a probability near 1 indicates an event is almost certain to occur.

9.2: Random-Experiment

A random experiment is a process that generates well-defined experimental outcomes. On any single repetition or trial, the outcome that occurs is determined completely by chance.

9.3: Sample-Space

The sample space for a random experiment is the set of all experimental outcomes.

9.4: Counting-Rule

Counting Rule for Multiple-Step Experiments: If an experiment can be described as a sequence of \({k}\) steps with \({n_1}\) possible outcomes on the first step, \({n_2}\) possible outcomes on the second step, and so on, then the total number of experimental outcomes is given by \(\{(n_1)(n_2) \cdots (n_k) \}\)

9.5: Tree-Diagram

A tree diagram is a graphical representation that helps in visualizing a multiple-step experiment.

9.6: Factorial

The factorial of a non-negative integer \({n}\), denoted by \(n!\), is the product of all positive integers less than or equal to n. The value of 0! is 1 i.e. \(0!=1\)

9.7: Combinations

Combination allows one to count the number of experimental outcomes when the experiment involves selecting \({k}\) objects from a set of \({N}\) objects. The number of combinations of \({N}\) objects taken \({k}\) at a time is equal to the binomial coefficient \(C_k^N\)

9.8: Permutations

Permutation allows one to compute the number of experimental outcomes when \({k}\) objects are to be selected from a set of \({N}\) objects where the order of selection is important. The same \({k}\) objects selected in a different order are considered a different experimental outcome. The number of permutations of \({N}\) objects taken \({k}\) at a time is given by \(P_k^N\)

9.9: Event

An event is a collection of sample points. The probability of any event is equal to the sum of the probabilities of the sample points in the event. The sample space, \({s}\), is an event. Because it contains all the experimental outcomes, it has a probability of 1; that is, \(P(S) = 1\)

9.10: Complement

Given an event \({A}\), the complement of A (\(A^c\)) is defined to be the event consisting of all sample points that are not in A. Thus, \(P(A) + P(A^{c}) =1\)

9.11: Union

Given two events A and B, the union of A and B is the event containing all sample points belonging to A or B or both. The union is denoted by \(A \cup B\)

9.12: Intersection

Given two events A and B, the intersection of A and B is the event containing the sample points belonging to both A and B. The intersection is denoted by \(A \cap B\)

9.13: Mutually-Exclusive

Two events are said to be mutually exclusive if the events have no sample points in common. Thus, \(A \cap B = 0\)

9.14: Conditional-Probability

Conditional probability is the probability of an event given that another event already

9.14: Conditional-Probability

occurred. The conditional probability of ‘A given B’ is \(P(A|B) = \frac{P(A \cup B)}{P(B)}\)

9.15: Events-Independent

Two events A and B are independent if \(P(A|B) = P(A) \quad \text{OR} \quad P(B|A) = P(B) \Rightarrow P(A \cap B) = P(A) \cdot P(B)\)

10.1: Random-Variable

A random variable is a numerical description of the outcome of an experiment. Random variables must assume numerical values. It can be either ‘discrete’ or ‘continuous.’

10.2: Discrete-Random-Variable

A random variable that may assume either a finite number of values or an infinite sequence of values such as \(0, 1, 2, \dots\) is referred to as a discrete random variable. It includes factor type i.e. Male as 0, Female as 1 etc.

10.3: Continuous-Random-Variable

A random variable that may assume any numerical value in an interval or collection of intervals is called a continuous random variable. It is given by \(x \in [n, m]\). If the entire line segment between the two points also represents possible values for the random variable, then the random variable is continuous.

10.4: Probability-Distribution

The probability distribution for a random variable describes how probabilities are distributed over the values of the random variable.

10.5: Probability-Function

For a discrete random variable x, a probability function \(f(x)\), provides the probability for each value of the random variable.

10.6: Expected-Value-Discrete

The expected value, or mean, of a random variable is a measure of the central location for the random variable. i.e. \(E(x) = \mu = \sum xf(x)\)

10.7: Variance-Discrete

The variance is a weighted average of the squared deviations of a random variable from its mean. The weights are the probabilities. i.e. \(\text{Var}(x) = \sigma^2 = \sum \{(x- \mu)^2 \cdot f(x)\}\)

10.8: Bivariate

A probability distribution involving two random variables is called a bivariate probability distribution. A discrete bivariate probability distribution provides a probability for each pair of values that may occur for the two random variables.

11.1: Uniform-Probability-Distribution

Uniform probability distribution is a continuous probability distribution for which the probability that the random variable will assume a value in any interval is the same for each interval of equal length. Whenever the probability is proportional to the length of the interval, the random variable is uniformly distributed.

11.2: Probability-Density-Function

The probability that the continuous random variable \({x}\) takes a value between \([a, b]\) is given by the area under the graph of probability density function \(f(x)\); that is, \(A = \int _{a}^{b}f(x)\ dx\). Note that \(f(x)\) can be greater than 1, however its integral must be equal to 1.

11.3: Normal-Distribution

A normal distribution (\({\mathcal {N}}_{(\mu,\, \sigma^2)}\)) is a type of continuous probability distribution for a real-valued random variable.

11.4: Standard-Normal

A random variable that has a normal distribution with a mean of zero \(({\mu} = 0)\) and a standard deviation of one \(({\sigma} = 1)\) is said to have a standard normal probability distribution. The z-distribution is given by \({\mathcal {z}}_{({\mu} = 0,\, {\sigma} = 1)}\)

12.1: Sampled-Population

The sampled population is the population from which the sample is drawn.

12.2: Frame

Frame is a list of the elements that the sample will be selected from.

12.3: Target-Population

The target population is the population we want to make inferences about. Generally (adn preferably), it will be same as ‘Sampled-Population,’ but it may differ also.

12.4: SRS

A simple random sample (SRS) is a set of \({k}\) objects in a population of \({N}\) objects where all possible samples are equally likely to happen. The number of such different simple random samples is \(C_k^N\)

12.5: Sampling-without-Replacement

Sampling without replacement: Once an element has been included in the sample, it is removed from the population and cannot be selected a second time.

12.6: Sampling-with-Replacement

Sampling with replacement: Once an element has been included in the sample, it is returned to the population. A previously selected element can be selected again and therefore may appear in the sample more than once.

12.7: Random-Sample

A random sample of size \({n}\) from an infinite population is a sample selected such that the following two conditions are satisfied. Each element selected comes from the same population. Each element is selected independently. The second condition prevents selection bias.

12.8: Proportion

A population proportion \({P}\), is a parameter that describes a percentage value associated with a population. It is given by \(P = \frac{X}{N}\), where \({x}\) is the count of successes in the population, and \({N}\) is the size of the population. It is estimated through sample proportion \(\overline{p} = \frac{x}{n}\), where \({x}\) is the count of successes in the sample, and \({N}\) is the size of the sample obtained from the population.

12.9: Point-Estimation

To estimate the value of a population parameter, we compute a corresponding characteristic of the sample, referred to as a sample statistic. This process is called point estimation.

12.10: Point-Estimator

A sample statistic is the point estimator of the corresponding population parameter. For example, \(\overline{x}, s, s^2, s_{xy}, r_{xy}\) sample statics are point estimators for corresponding population parameters of \({\mu}\) (mean), \({\sigma}\) (standard deviation), \(\sigma^2\) (variance), \(\sigma_{xy}\) (covariance), \(\rho_{xy}\) (correlation)

12.11: Point-Estimate

The numerical value obtained for the sample statistic is called the point estimate. Estimate is used for sample value only, for population value it would be parameter. Estimate is a value while Estimator is a function.

12.12: Sampling-Distribution

The sampling distribution of \({\overline{x}}\) is the probability distribution of all possible values of the sample mean \({\overline{x}}\).

12.13: Standard-Error

In general, standard error \(\sigma_{\overline{x}}\) refers to the standard deviation of a point estimator. The standard error of \({\overline{x}}\) is the standard deviation of the sampling distribution of \({\overline{x}}\).

12.14: Sampling-Error

A sampling error is the difference between a population parameter and a sample statistic.

12.15: Central-Limit-Theorem

Central Limit Theorem: In selecting random samples of size \({n}\) from a population, the sampling distribution of the sample mean \({\overline{x}}\) can be approximated by a normal distribution as the sample size becomes large.

13.1: Interval-Estimate

Because a point estimator cannot be expected to provide the exact value of the population parameter, an interval estimate is often computed by adding and subtracting a value, called the margin of error, to the point estimate. \(\text{Interval Estimate} = \text{Point Estimate} \pm \text{Margin of Error}\)

13.2: Confidence-Interval

Confidence interval is another name for an interval estimate. Normally it is given as \((1 - \alpha)\). Ex: 95% confidence interval

13.3: Confidence-Coefficient

The confidence level expressed as a decimal value is the confidence coefficient (\(1-{\alpha}\)). i.e. 0.95 is the confidence coefficient for a 95% confidence level.

13.4: t-distribution

When \({s}\) is used to estimate \({\sigma}\), the margin of error and the interval estimate for the population mean are based on a probability distribution known as the t distribution.

13.5: Degrees-of-Freedom

The number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary. In general, the degrees of freedom of an estimate of a parameter are \((n - 1)\).

14.1: Hypothesis-Testing

Hypothesis testing is a process in which, using data from a sample, an inference is made about a population parameter or a population probability distribution.

14.2: Hypothesis-Null

Null Hypothesis \((H_0)\) is a tentative assumption about a population parameter. It is assumed True, by default, in the hypothesis testing procedure.

14.3: Hypothesis-Alternative

Alternative Hypothesis \((H_a)\) is the complement of the Null Hypothesis. It is concluded to be True, if the Null Hypothesis is rejected.

14.4: Hypothesis-1T-Lower-Tail

\(\text{\{Left Tail or Lower Tail\} } {H_0} : {\mu} \geq {\mu}_0 \iff {H_a}: {\mu} < {\mu}_0\)

14.5: Hypothesis-1T-Upper-Tail

\(\text{\{Right Tail or Upper Tail\} } {H_0} : {\mu} \leq {\mu}_0 \iff {H_a}: {\mu} > {\mu}_0\)

14.6: Hypothesis-2T-Two-Tail

\(\text{\{Two Tail\} } {H_0} :{\mu} = {\mu}_0 \iff {H_a}: {\mu} \neq {\mu}_0\)

14.7: Error-Type-I

The error of rejecting \({H_0}\) when it is true, is Type I error \(({\alpha})\).

14.8: Error-Type-II

The error of accepting \({H_0}\) when it is false, is Type II error \(({\beta})\).

14.9: Level-of-Significance

The level of significance \((\alpha)\) is the probability of making a Type I error when the null hypothesis is true as an equality.

14.10: Significance-Tests

Applications of hypothesis testing that only control for the Type I error \((\alpha)\) are called significance tests.

14.11: Test-Statistic

Test statistic is a number calculated from a statistical test of a hypothesis. It shows how closely the observed data match the distribution expected under the null hypothesis of that statistical test. It helps determine whether a null hypothesis should be rejected.

14.12: Tailed-Test

A one-tailed test and a two-tailed test are alternative ways of computing the statistical significance of a parameter inferred from a data set, in terms of a test statistic.

14.13: One-Tailed-Test

One-tailed test is a hypothesis test in which rejection of the null hypothesis occurs for values of the test statistic in one tail of its sampling distribution.

14.14: Approach-p-value

The p-value approach uses the value of the test statistic \({z}\) to compute a probability called a p-value.

14.15: p-value

A p-value is a probability that provides a measure of the evidence against the null hypothesis provided by the sample. The p-value is used to determine whether the null hypothesis should be rejected. Smaller p-values indicate more evidence against \({H_0}\).

14.16: Approach-Critical-Value

The critical value approach requires that we first determine a value for the test statistic called the critical value.

14.17: Critical-Value

Critical value is the value that is compared with the test statistic to determine whether \({H_0}\) should be rejected. Significance level \({\alpha}\), or confidence level (\(1 - {\alpha}\)), dictates the critical value (\(Z\)), or critical limit. Ex: For Upper Tail Test, \(Z_{{\alpha} = 0.05} = 1.645\).

14.18: Acceptance-Region

A acceptance region (confidence interval), is a set of values for the test statistic for which the null hypothesis is accepted. i.e. if the observed test statistic is in the confidence interval then we accept the null hypothesis and reject the alternative hypothesis.

14.19: Margin-Error

The margin of error tells how far the original population means might be from the sample mean. It is given by \(Z\frac{\sigma}{\sqrt{n}}\)

14.20: Rejection-Region

A rejection region (critical region), is a set of values for the test statistic for which the null hypothesis is rejected. i.e. if the observed test statistic is in the critical region then we reject the null hypothesis and accept the alternative hypothesis.

14.21: Two-Tailed-Test

Two-tailed test is a hypothesis test in which rejection of the null hypothesis occurs for values of the test statistic in either tail of its sampling distribution.

14.22: Approach-p-value-Steps

p-value Approach: Form Hypothesis | Specify \({\alpha}\) | Calculate test statistic | Calculate p-value | Compare p-value with \({\alpha}\) | Interpret

14.23: Power

The probability of correctly rejecting \({H_0}\) when it is false is called the power of the test. For any particular value of \({\mu}\), the power is \(1 - \beta\).

14.24: Power-Curve

Power Curve is a graph of the probability of rejecting \({H_0}\) for all possible values of the population parameter \({\mu}\) not satisfying the null hypothesis. It provides the probability of correctly rejecting the null hypothesis.

16.1: Distribution-Chi-Square

Whenever a simple random sample of size \({n}\) is selected from a normal population, the sampling distribution of \(\frac{(n-1)s^2}{{\sigma}^2}\) is a chi-square distribution with \({n - 1}\) degrees of freedom.

16.2: Distribution-F

The F distribution is based on sampling from two normal populations.

19.1: Regression-Analysis

Regression analysis can be used to develop an equation showing how two or more variables variables are related.

19.2: Variable-Dependent

The variable being predicted is called the dependent variable \(({y})\).

19.3: Variable-Independent

The variable or variables being used to predict the value of the dependent variable are called the independent variables \(({x})\).

19.4: Simple-Linear-Regression

The simplest type of regression analysis involving one independent variable and one dependent variable in which the relationship between the variables is approximated by a straight line, is called simple linear regression.

19.5: Regression-Model

The equation that describes how \({y}\) is related to \({x}\) and an error term is called the regression model. For example, simple linear regression model is given by equation (19.1)

23.1: Parametric-Methods

Parametric methods are the statistical methods that begin with an assumption about the probability distribution of the population which is often that the population has a normal distribution. A sampling distribution for the test statistic can then be derived and used to make an inference about one or more parameters of the population such as the population mean \({\mu}\) or the population standard deviation \({\sigma}\).

23.2: Distribution-free-Methods

Distribution-free methods are the Statistical methods that make no assumption about the probability distribution of the population.

23.3: Nonparametric-Methods

Nonparametric methods are the statistical methods that require no assumption about the form of the probability distribution of the population and are often referred to as distribution free methods. Several of the methods can be applied with categorical as well as quantitative data.

ERRORS

1.1: cannot-open-connection

Error in file(file, ifelse(append, "a", "w")) : cannot open the connection

1.2: need-finite-xlim

Error in plot.window(...) : need finite ’xlim’ values

1.3: par-old-par

Error in par(old.par) : invalid value specified for graphical parameter "pin"

2.1: plot-finite-xlim

Error in plot.window(...) : need finite ’xlim’ values

2.2: Function-Not-Found

Error in arrange(bb, day) : could not find function "arrange"

3.1: Object-Not-Found-01

Error in match.arg(method) : object ’day’ not found

3.2: Comparison-possible

Error in day == 1 : comparison (1) is possible only for atomic and list types

3.3: UseMethod-No-applicable-method

Error in UseMethod("select") : no applicable method for ’select’ applied to an object of class "function"

3.4: Object-Not-Found-02

Error: Problem with mutate() column ... column object ’arr_delay’ not found

7.1: gg-stat-count-geom-bar

Error: stat_count() can only have an x or y aesthetic.

11.1: ggplot-list

Error in is.finite(x) : default method not implemented for type ’list’

11.2: ggplot-data

Error: Must subset the data pronoun with a string.